Why is it that in most projects 70% of the BI/DWH development effort is spent on data integration? What makes data integration so complex? According to datawarhouse guru Ralph Kimbal there are 38 unique subsystems of ETL. After reading the Kimball article, you may conclude it is no wonder that so much effort is spent on this aspect of BI/DWH. However, I’m convinced the effort spent on ETL in BI/DWH development can be reduced drastically by:
1. Focussing on data quality issues at the source instead of dealing with them in ETL
Having just completed an excelelnt training on Total Information Quality Management, I’m even more convinced that a lot of effort spent on data quality issues should be avoided. If the data quality of the source system is not improved, the effort spent here is scrap and rework at best.
So reduce the effort here by spending more time on finding the causes for these data quality issues where they arise (gemba: In quality management gemba means the manufacturing floor and the idea is that if a problem occurs, the engineers must go there to understand the full impact of the problem). Improve processes and use mistake proofing techniques to avoid these. (Poka yoke: a Japanese term that means fail-safing or “mistake-proofing”. A poka-yoke is any mechanism in a manufacturing ocess that helps an equipment operator avoid (yokeru) mistakes (poka). Its purpose is to eliminate product defects by preventing, correcting, or drawing attention to human errors as they occur).
In other words: focus on source data quality improvement so you can reduce the eforts spent on data quality checks and data cleansing.
2. Architecting for change
The datawarehouse and datamarts should be architected in such a way that changes require minimal effort. One elegant way to achieve this this is to apply the Data Vault method in which the datawarehouse is moddeled according to a standard method and data integration complexity is put between the datawarehouse and the (staging out area or datamarts) , allowing for loading one to one data from the source in a largely standardised manner. In this way changes in business requirements have minimal impact on the datawarehouse layer. Be sure to check out the ”The next generation EDW” articles on the articles page for more information.
3. Generate instead of build
I am in strong favour of the use of products that provide metadata driven automation and generation of SQL and DDL such as Kalido DIW and BI-Ready. When using these tools changes are much faster to realize, morover a lot mof time and effort is saved because less coding is required. With these tools data integration is much more a case of modelling, configuring and generating instead of building.
So yes, I’m convinced data integration effort ican be significantly reduced.