Introduction
In
Business Intelligence and data science there has always been this debate about
whether a star schema is better than a snowflake schema or vice versa. This
post, is not about that argument. This is a discussion that comes after the
choice has been made.
Once you have chosen whether to go star or snowflake, there is the next question which is: "Is there a case whereby lumping everything into the fact with a few dimensions is better than having a streamlined fact table with many dimensions? "
Once you have chosen whether to go star or snowflake, there is the next question which is: "Is there a case whereby lumping everything into the fact with a few dimensions is better than having a streamlined fact table with many dimensions? "
In this discussion, I will try arguing for both cases from my experience thus far. For my debate regarding star vs snowflake, you may want to look at:
Big Fact, few dimensions
Many
people can argue to put everything into the fact table. This is not necessarily
wrong per se. It takes all the hard work out of modelling correctly and making
sure the data links together properly.
There is
a case for this, the case to join everything into one gigantic, monstrous Fact
table.
For example: If the de-normalising of an ultra-normalised database makes it hard to de-normalise, the pressure is there to just lump almost everything into the Fact and have a few dimensions. There are advantages and disadvantages.
For example: If the de-normalising of an ultra-normalised database makes it hard to de-normalise, the pressure is there to just lump almost everything into the Fact and have a few dimensions. There are advantages and disadvantages.
Advantages
-
All
needed data is found in the Fact
-
It is
easier to query one table for fields instead of having many hops through
dimensions.
Disadvantages
-
Increases
the size of the Fact table
-
Reduces
performance of a dashboard
-
Reload
time increases due to the huge volumes of the Fact table
-
Un used
fields are loaded into the dashboard
-
Incorrect
measures could be calculated due to potential duplicates arising from joins
Small Fact, Many Dimensions
As mentioned above, best
practice is to have a Fact table with links/keys to dimensions and measures
that will be used on the dashboard. The Business Intelligence tool, Power BI,
does this well. Although other tools and cubes should also be built in this
manner. Building your model like this also promotes a separation of concern.
We should always keep in mind that dimensions are for describing entities. Just as we have in development, a class should describe an object so too we have in Data warehousing that dimensions describe entities and the Fact is used to link all of them together with its measures.
We should always keep in mind that dimensions are for describing entities. Just as we have in development, a class should describe an object so too we have in Data warehousing that dimensions describe entities and the Fact is used to link all of them together with its measures.
Advantages
-
The size
of the Fact table is smaller
-
Performance
of a dashboard is faster
-
Reload
time decreases due to the low volumes of the Fact table
-
Un used
fields are not loaded into the dashboard
-
Duplicates
in Dimensions should not be allowed therefore more accurate measures
Disadvantages
-
Linking to the Fact table must be precise
-
Some measures could be incorrect if linking is not
done well
Conclusion
In
hindsight, it should be stated that that it is not best practice to do this. A
Fact table should consist of several keys linking to different dimensions with
many measures. This takes the burden off the fact table.
The dimensions should be used to describe an entity and the fact should be freed of this. The role of the Fact table should be to contain keys and measures. This measure will then be used to determine KPI’s on the front-end of the dashboard.
The dimensions should be used to describe an entity and the fact should be freed of this. The role of the Fact table should be to contain keys and measures. This measure will then be used to determine KPI’s on the front-end of the dashboard.
Modelling
is 80% and creating the dashboard is only 20% of the whole work
Good article, I like that you placed emphasis on the importance of modelling (80 20). Most performance issue are as a result of bad schema design.
ReplyDelete