To create a data model for our data warehouse we use tooling supplied by the ERP vendor. This does probably matter due to the fact that is has it’s limitations. We inherited this environment with a certain design. We were new to data warehousing and performing this as only a part of our job, so we had a steap learning curve. 🙂 Our basic design for our data warehouse is like this:
(source) -> (staging table) -> (Persistent Staging Area table) -> (set of views) -> (dimension/fact table)
staging table: has only 1 source table, truncated before load, only the delta of records since yesterday is loaded
Persistent Staging Area table: never truncated, loaded with delta records of staging table. So the result is that records are never deleted, current records are updated based on the natural key.
All dimensions and fact tables are truncated and re-loaded every night. This is possible due to the persistent staging area. No history is required in the dimension or fact tables currently. This is probably designed like this in the past, because you are able to completely rebuild all the dimension and fact tables if you like. It makes changes a little more easy to implement, since you do not have to backup the data every time, etc.
We are re-thinking our data warehouse design, since we have learned a lot in the past years. 🙂 We have ETL performance issues, so we want to look at incrementally loading the dimension and fact tables, but are struggling with the following issue.
Let’s say we cut out the Persistent Staging Area layer, so we only have the staging tables which are loaded with delta records only. We have a view C that combines data from source table A and B. This view C is the source for dimension table D and fact table F. (this is a very simplified example)
Now, a column value for a record in table A changes. This column value is an attribute in the dimension table D. Since the view C is based on 2 staging tables, which are incrementally loaded, we will see this records in view C depending on the join type. Let’s say it is a full outer join. We only see NULL values for table B fields, together with this changed column value. This will enter the dimension table D as NULL values for fields of table B and the field value from table A. This is unwanted of course, since it makes the data inconsistent. At this moment, this problem is solved with using a Persistent Staging Area. With the Persistent Staging Area, the record in there will be updated and propagate correctly to our dimension, since that is reloaded every night. I hope I have explained it clearly.
So we want to look at cutting out the Persistent Staging Area layer, but are not sure how to cope with changes like this. So the scenario where we load only changes to our staging tables and truncate these prior to re-load (to load the new changes). Iam not sure how you would normally solve this. There is probably always some sort of temporary staging required for this between the staging table and the dimension or fact table? Or am I missing something here?
So my question is not about the delta load of the staging tables, I know about CDC, or that trunacting and reloading our dimension and fact tables is bad practice, but Iam probably missing something crucial in how you would bring your data from your staging tables (with delta records only) to your dimension/fact tables (which are combined from many source tables) and only 1 source record changes in a consistent manner. There should be some intermediate staging to make things consistent right?