They got their name based on the operations they are intended to do. Extract, Transform and Load are their primary operations. They make our life easier by automating most of the things in creating or running the data warehouses. There are several such software that exist in market. Talend is an open source one whereas Informatica and Data Stage are leaders in this market.
Now. What is the misunderstanding here?
- Extract – This operation is supposed to be carried out on several source systems.
- Transform – This operation is supposed to happen at the ETL tool itself.
- Load – Obviously, this is supposed to be done at the target data warehouse.
But I have seen some systems that do the following.
- extract the existing data from the target data warehouse (not from the source)
- add the daily incremental data
- do some transformation
- truncate the existing data
- load the new data back to the target
What is wrong here?
So many GBs of data are extracted from and loaded back into the data warehouse. Do you see the huge number of Disk I/Os here? If we are able to load the incremental data and process the data within the data warehouse or better say the database itself, we can save the time spent for the disk input/output operations. We do not need the ETL tool here. The database engine should be able to apply the transformation logic. Database should not be used just for storing the data. Database has powerful processing engines SQL or stored procedures. We should know to use them correctly.
I wonder what would have made them to choose such a design approach here?