For a thousand reasons, data gallops around the corporation like a cackling wicked witch on Halloween that is on a caffeine high. Unintegrated data is everywhere and it needs to be integrated before it can become useful. If left unintegrated, the data will be interpreted and used improperly.
So there is a question that naturally arises – should the data be integrated at the source – where it is collected and where it resides or should the data just be collected, loaded into the source, then integrated as it is used – on the fly?
This is an architectural question which has many long term consequences.
In order to integrate the data as it is collected (typically using some form of ETL) a very serious amount of work is required. The source of the data, what data was included and what data was excluded, the nature of the algorithms needed for integration are all factors. In a word it takes work and thought and even some guesswork to create integration at the source of data. And for many reasons most people don’t like that four letter word – work. Most people just want to kick the can down the road and let somebody else do the thinking and working.
So, an alternative is just to collect the data in an unintegrated state and do the integration when the data is needed. This is called integration on the fly. This approach optimizes the efficiency of the collection of the data at the expense of the use of the data.
From the standpoint of avoiding work, integrating data on the fly seems to be a logical choice. But is it really? Is work really avoided?
Let’s consider the amount of work required to integrate data on the fly. Every time the data is accessed, it must be integrated. So would you rather integrate data once, when the data is collected or integrate repeatedly, every time the data is used? When you look at it that way it makes no sense to integrate data on the fly. It is much more efficient to integrate the data as it is collected rather than as it is used. Not integrating data at the source merely postpones and multiplies the work required for integration.
Stated differently, do you want to have to integrate the data once or multiple times?
But there are other considerations.
The consistency of integration is compromised when everyone does their own brand of integration. One person uses one algorithm for integration and another person uses another algorithm for integration. Each person thinks they are correct. The consistency of the integrated data is severely compromised when everyone does their own thing. So there is a very big drawback to integrating data on the fly.
Yet another negative associated with integration on the fly occurs when people fail to integrate the data at all. They may not know that integration is needed when they access the data. Or they are just in too big of a hurry to do integration. Or they are just plain lazy. And what happens is that the data they are operating on is incorrect.
So – can you do integration of data on the fly? Yes of course you can. But –
– In the long run it costs more – much more – and doesn’t save anything
– It risks having data integrated inconsistently
– It risks not having the data integrated at all.
From an architectural perspective, integration on the fly is a very poor choice.
(Note: the notion of integration on the fly is not a new concept. It started with something called the “virtual warehouse”. Today there are vendors that try to sell integration on the fly, such as Incorta and others.)
Bill Inmon lives in Denver with his wife and his two Scotty dogs – Jeb and Lena. Jeb has taught Bill many tricks – feed me, open the door, clean up after me. Lena – his little sister – is learning how to teach Bill tricks as well. Lena learns fast.