CTS – Your Technology Partner

Extracting Data Shouldn’t Be Painful

Written by Matt Jones on August 4, 2015

For many IT professionals, this 1823 illustration by Louis Boilly perfectly captures our feelings toward data extraction.

data extraction

In my last post, The Times They Are a Changin’, I stated that the current Question First process for business intelligence evolved out of a time where data extraction was difficult and time consuming. I also talked about how today’s “Big Data” technologies like Hadoop are making data extraction much less painful.

Data extraction would not be much to talk about if loading was not included. You have probably heard that “Big Data” technologies are shaking up our entire perspective on extracting and loading data.

The following picture summarizes the shakeup really simply.
old school vs new school

In the past, storing data for processing meant developing a database for storage. Developing a database meant developing a new data model, loading required transformation and storage, and transformation meant $$$$.

Today, with the advancements in Hadoop, data can be stored as-is on what amounts to a file system. You have to perform transformation only when you want to extract parts of the data and this transformation may be logical and not physical. The nice thing about Hadoop is all the methods it has for ingesting data and the brevity of code required to implement those methods. I like to break Hadoop data ingestion into three categories: file system, relational database, and streaming. The following diagram provides some details into the techniques related to the above categories.

illustration for data extraction

The nice part of these methods is that using each one of these is straightforward. Hadoop provides a nice Rest API for application access to the file system and the ability to interact with Hadoop just like a standard folder for your admins.

With ODBC drivers, your traditional SQL developers can push structured data directly into Hadoop using tried and true techniques. If they are feeling lazy, they can write Sqoop commands to dump the data directly into Hadoop with fewer steps than the SQL approach. Finally, you can connect streaming data of all types directly to Hadoop by simply adding a few lines to a configuration file.

Now that I have shown you how straightforward extracting data is in Hadoop, join me next time as I show you how extracting data can rely on the same skills your current staff possess.