CTS – Your Technology Partner

Data Discovery Enablement

Written by Matt Jones on January 30, 2014

“Our $$$$$$$$$$$ Data Warehouse sucks and it does not have the information I need to do my job!”

“It will take HOW long to add that information to my report?”

Unfortunately, today these are sentiments expressed or felt by many business users across the world. For years, IT has been getting a black eye for building large data warehouses that end users are not happy with. In fact, a number of customers lately have told me not to even use the word Data Warehouse in conversation or decision makers will stop paying attention. So what is the real issue here? Are data warehouses giant money sucking endeavors that are rarely successful? I think DEFINITELY NOT. When I think about the real issue I can’t stop hearing The Animals singing:

Baby, do you understand me now?

Sometimes I feel a little mad.

But don’t you know that no one alive can always be an angel?

When things go wrong, I seem to be bad.

But I’m just a soul whose intentions are good:

Oh Lord! Please don’t let me be misunderstood …

To understand the need for data warehouses you really have to consider that requirements for data will have varying levels of necessary data quality and uniqueness. End users, will trade quality and uniqueness for a reduction in time IT requires to expose the data. Consider the following diagram.


With the evolution of Data Warehousing, organizations are making large investments to get Gold Standard information from the investments. In the industry, we see all kinds of initiatives related to data governance, data lineage, Master Data Management, data quality, and so on. These investments are necessary and valuable for creating auditable and traceable data that organizations can submit to government and financial agencies, investors and the public. In situation where jail time or fines are a penalty for reporting quality issues, I would definitely want the rigor of a gold standard process.

However, business users generally have misconceptions related to data warehouses. Including:

• Enterprise Data Warehouses contain a majority of the enterprise’s data: Data Warehouses usually have no more than 20% of an enterprises data

• Adding a new field is simple: Adding information to a data warehouse can involve modification / creation of up front ETL, MDM, and data quality scripts and modification of table structure across multiple data layers

Because of these misconceptions, IT is constantly hearing comments like the ones I opened this blog with. However, there is GOOD NEWS! With the advancements in business intelligence technology delivered by “Big Data” tools, IT organizations can finally deliver closer to 100% of the enterprise’s data to business users. HOW?

Utilizing Hadoop, organizations can easily dump structured data into a Hadoop cluster without the pains of traditional ETL processing. Through Hadoop, organizations can expose the raw data with Hive scripts or through using massively parallel processing devices like Microsoft PDW and Oracle Exadata connected to Hadoop. Once exposed, many reporting tools like Microsoft PowerView and PowerBI, Tableau, and MicroStrategy can connect to the Hadoop cluster and enable end users to perform data discovery and reporting on the data. Another interesting approach is in memory data solutions like Microsoft’s SQL Server Analysis Services Tabular Mode, Oracle TimesTen, and SAP Hana to name a few. Utilizing these technologies with Hadoop, companies can physically leave the raw data inside Hadoop and logically group the data in an in memory solution. Most of the technologies allow direct query so that the data is actually queried via the source system and the in memory solution just provides a logical definition that report solutions can traverse to simplify reporting.

The import thing to remember with this architecture that the goal addressed is providing quick access to a much larger percentage of and organizations data assets. This data should be considered RAW and the meaning of this will have to be reinforced with end users. Once a series of information is proven extremely valuable, organizations should incorporate the data in their GOLD STANDARD architecture. I know what you are thinking and YES, all you have to do is point your ETL process to Hadoop as the data source for the data warehouse.