Consistent data in dimension and fact tables from multiple incrementally loaded staging tables

To create a data model for our data warehouse we use tooling supplied by the ERP vendor. This does probably matter due to the fact that is has it’s limitations. We inherited this environment with a certain design. We were new to data warehousing and performing this as only a part of our job, so we had a steap learning curve. 🙂 Our basic design for our data warehouse is like this:

(source) -> (staging table) -> (Persistent Staging Area table) -> (set of views) -> (dimension/fact table)

staging table: has only 1 source table, truncated before load, only the delta of records since yesterday is loaded
Persistent Staging Area table: never truncated, loaded with delta records of staging table. So the result is that records are never deleted, current records are updated based on the natural key.

All dimensions and fact tables are truncated and re-loaded every night. This is possible due to the persistent staging area. No history is required in the dimension or fact tables currently. This is probably designed like this in the past, because you are able to completely rebuild all the dimension and fact tables if you like. It makes changes a little more easy to implement, since you do not have to backup the data every time, etc.

We are re-thinking our data warehouse design, since we have learned a lot in the past years. 🙂 We have ETL performance issues, so we want to look at incrementally loading the dimension and fact tables, but are struggling with the following issue.

Let’s say we cut out the Persistent Staging Area layer, so we only have the staging tables which are loaded with delta records only. We have a view C that combines data from source table A and B. This view C is the source for dimension table D and fact table F. (this is a very simplified example)

Now, a column value for a record in table A changes. This column value is an attribute in the dimension table D. Since the view C is based on 2 staging tables, which are incrementally loaded, we will see this records in view C depending on the join type. Let’s say it is a full outer join. We only see NULL values for table B fields, together with this changed column value. This will enter the dimension table D as NULL values for fields of table B and the field value from table A. This is unwanted of course, since it makes the data inconsistent. At this moment, this problem is solved with using a Persistent Staging Area. With the Persistent Staging Area, the record in there will be updated and propagate correctly to our dimension, since that is reloaded every night. I hope I have explained it clearly.

So we want to look at cutting out the Persistent Staging Area layer, but are not sure how to cope with changes like this. So the scenario where we load only changes to our staging tables and truncate these prior to re-load (to load the new changes). Iam not sure how you would normally solve this. There is probably always some sort of temporary staging required for this between the staging table and the dimension or fact table? Or am I missing something here?

So my question is not about the delta load of the staging tables, I know about CDC, or that trunacting and reloading our dimension and fact tables is bad practice, but Iam probably missing something crucial in how you would bring your data from your staging tables (with delta records only) to your dimension/fact tables (which are combined from many source tables) and only 1 source record changes in a consistent manner. There should be some intermediate staging to make things consistent right?

Does the fact a table is big can impact the overall performance of a PostgreSQL server?

If I have a table that is getting bigger and bigger (i.e. it is taking more and more storage space – currently 65GB), can this affect the overall performance of the PostgreSQL server, e.g. impact the speed of queries on other tables?

This is for a PostgreSQL 9.6 database (we plan to upgrade to 10 -> 11 -> 12 later this year), hosted on Google Cloud (Cloud SQL for PostgreSQL).

mediawiki – “Template:Template other” is on the top of the page despite the fact I didn’t put it there

The preview

If it helps, I’m using Miraheze as my host of choice. As you can see from the above, it says Template: Template other despite the fact the page source begins with:

{{Infobox
|image    = [[File:4004small.png]]
<!-- snip -->

The '''4004''', released in 1971, is the first microprocessor developed by Intel, and is regarded as the first...

Nowhere in the source is “Template other” mentioned, the only other templates I use are {{PAGENAME}} and {| for a table. How should I get rid of that red text at the top?

customs and immigration – UK Visitor Visa: Is the fact me and my girlfriend are planning to get married in her country good evidence she will leave again

I am a UK citizen and my long-time girlfriend is a Russian citizen. She has visited the UK once before a year ago on a visitor visa for a month.

In around 6 months we plan to get married in Russia, and then to apply for a family visa to stay in the UK. Until that time, I would like for her to stay with me for a few months on a visitor visa (I can’t go to Russia right now due to lockdown rules, and the family visa is too expensive right now — I’m starting a PhD in October which would give me the funds). Would it be sensible to be upfront on my visa application about my plans, and are these plans sensible?

Thank you

optimization – How to leverage the fact that I’m solving 1000’s of very similar SMT instances?

I have a core SMT solver problem consisting of 100,000 bit-vector array clauses, and one 10000-dimensional bit-vector array. Then, my program takes as input k << 100,000 new clauses, and adds them to the core SMT problem. My goal is, for any input of k clauses, to solve the entire problem.

Is there any static optimization or learning I could do on the core problem in order to find a better way to solve each of its siblings? For instance, is there some property of the graph of bit-vector variables being constrained in each clause that I could use as a heuristic for solving the specific instances?

Thanks!

why do all black looters matter protestors ignore the fact that George Fraud had a criminal history and wasn’t a clean cut person?

Miranda was a scumbag too, but thanks to his case, you now know that you have the right to say, “I want to speak to my attorney” and then clam up when you’re arrested.  

Excessive force, particularly when it’s lethal, is a problem when it comes to law enforcement.  It has been a problem for a long time, while society and the government has repeatedly looked the other way.  Police brutality is a major issue and it needs to be handled.

George Floyd didn’t deserve to die like that.  It’s that simple.  You’re just looking for any reason you can to justify murder.  If you or I or any civilian knelt on the neck of a man for nearly 9 minutes, while 2 other civilians held him down and a third stood by in order to intimidate anybody who might intervene, then there would have been arrests and murder charges filed almost immediately.  So how come civilians are held to a higher standard for restraint than supposedly trained professionals?  Let’s fix the system and make it better.

Fact or myth: Native Americans came from Asia ?

Look at a Native American. Look at a Chinese person. See the slick, straight black hair? See the shape of the eyes? See the high foreheads? 

See the bronze color of their skin? There has to be a family resemblance. As far as facts, there has been tons of research, and by now, I am sure DNA has been done to show that American Indians and Chinese or Asians are closely related. 

Even 40 years ago, there was research done to show how they might have migrated from China over the Bering Strait in Alaska to filter down over thousands of years to the rest of the Americas. They have shown how there was a land-bridge there, which no longer exists.

I think it is fascinating. I doubt that the exodus was from the Americas to China.