Tuesday, 30 May 2017

One Table To Rule Them All


Introduction


In Business Intelligence and data science there has always been this debate about whether a star schema is better than a snowflake schema or vice versa. This post, is not about that argument. This is a discussion that comes after the choice has been made. 

Once you have chosen whether to go star or snowflake, there is the next question which is: "Is there a case whereby lumping everything into the fact with a few dimensions is better than having a streamlined fact table with many dimensions? "

In this discussion, I will try arguing for both cases from my experience thus far. For my debate regarding star vs snowflake, you may want to look at:


Big Fact, few dimensions


Many people can argue to put everything into the fact table. This is not necessarily wrong per se. It takes all the hard work out of modelling correctly and making sure the data links together properly.
There is a case for this, the case to join everything into one gigantic, monstrous Fact table. 

For example: If the de-normalising of an ultra-normalised database makes it hard to de-normalise, the pressure is there to just lump almost everything into the Fact and have a few dimensions. There are advantages and disadvantages.

Advantages

-        All needed data is found in the Fact
-        It is easier to query one table for fields instead of having many hops through dimensions.
Disadvantages

-        Increases the size of the Fact table
-        Reduces performance of a dashboard
-        Reload time increases due to the huge volumes of the Fact table
-        Un used fields are loaded into the dashboard
-        Incorrect measures could be calculated due to potential duplicates arising from joins

Small Fact, Many Dimensions

 

As mentioned above, best practice is to have a Fact table with links/keys to dimensions and measures that will be used on the dashboard. The Business Intelligence tool, Power BI, does this well. Although other tools and cubes should also be built in this manner. Building your model like this also promotes a separation of concern. 

We should always keep in mind that dimensions are for describing entities. Just as we have in development, a class should describe an object so too we have in Data warehousing that dimensions describe entities and the Fact is used to link all of them together with its measures.

Advantages

-        The size of the Fact table is smaller
-        Performance of a dashboard is faster
-        Reload time decreases due to the low volumes of the Fact table
-        Un used fields are not loaded into the dashboard
-        Duplicates in Dimensions should not be allowed therefore more accurate measures

Disadvantages

-        Linking to the Fact table must be precise
-        Some measures could be incorrect if linking is not done well

Conclusion

 

In hindsight, it should be stated that that it is not best practice to do this. A Fact table should consist of several keys linking to different dimensions with many measures. This takes the burden off the fact table.
 

The dimensions should be used to describe an entity and the fact should be freed of this. The role of the Fact table should be to contain keys and measures. This measure will then be used to determine KPI’s on the front-end of the dashboard.

Modelling is 80% and creating the dashboard is only 20% of the whole work

1 comment:

  1. Good article, I like that you placed emphasis on the importance of modelling (80 20). Most performance issue are as a result of bad schema design.

    ReplyDelete