Tuesday, 8 March 2016

Star Schema Vs Snowflake

Introduction

 

In Datawarehousing, it is often asked, which is better between Star Schema and Snowflake Schema. This is no different in Qlikview whereby the model, is actually one of the most important entity to the whole process. Once the back-end (Where data associations are made) is well scripted, it is seldom that much more work or difficulty of showing relevant data will be a problem in the front-end where the user interacts with the data.

In this post, I will argue which is the best schema to use when it comes to Qlikview. Using arguments from seasoned Qlikview practitioners as well as examples of my own, I will try to show the advantages and disadvantages of both the Snowflake and Star Schemas in terms of:

  1. Practicality
  2. Performance
  3.  

What is a Star Schema?

 

In computing, the Star Schema is the simplest style of data mart schema. The Star Schema consists of one or more fact tables referencing any number of dimension tables. The Star Schema is an important special case of the Snowflake Schema, and is more effective for handling simpler queries.
The Star Schema gets its name from the physical model's resemblance to a star shape with a fact table at its center and the dimension tables surrounding it representing the star's points. (Wikipedia, 2015)

To summaries the above, a Star Schema, is a structure whereby dimensions are connected via a fact table that is situated in the centre. The fact table will consist of foreign keys of all the dimension tables. The fact table will contain, besides the foreign keys, measures such as total sales, total costs, total number of clients, etc. The dimension tables, will thus have details that describe the dimension further.

Eg: The dimension table Advertiser_dimension, will contain fields such as; Advertiser_id, Advertiser_name, Advertiser_address




What is a Snowflake Schema?

 

The Snowflake Schema is an extension of the Star Schema, where each point of the star explodes into more points. In a Star Schema, each dimension is represented by a single dimensional table, whereas in a Snowflake Schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy. (Snowflake Schema, 2015)

So unlike the Star Schema, the Snowflake Schema organises data inside the database in order to eliminate redundancy and thus helps to reduce the amount of data. The hierarchy of the business and its dimensions are preserved in the data model through referential integrity.
Eg: The dimension table Account_Dimension, will have numerous detail about the dimension across different tables linked to it.



Performance (Hops)

 

Most of the Qlikcommunity claim that to minimise the number of hops in a data model significally increases performance of the model. Although this is true to a degree, this is not the be-all and end-all to improve performance. When modelling data, one must also take into consideration, the data itself. You need to take the relevant data. Even from this level, the data practioner has to keep the user in mind and what the users needs are. Without knowing the users needs, the model could spiral into an unforgiving monster. Henric Cronström, A Qlikview developer, claims that minimising the number of hops only affects the performance marginally.




Others argue that the number of hops do affect performance. For example, if you have a model like the above whereby you need 4 hops from the fact table to the last dimension table, that is performance heavy. Take for instance the image below, let's just say you want to show the Salesperson's office city where most of the orders originate in terms of the Supplier and the products he supplies. As the image below shows, this will take four hops to get the data you need. This is performance heavy on Qlikview because the set analysis (Is a way to define an aggregation scope different from current selection) used could get relatively complicated.

In most cases, there is room for a model with many hops. In the image below, this would have been the best way to design the data in order to make the model readable and realiable and not have a combusome model where tables have unrelated fields in them. So it is very important to understand the data and have the model correctly modelled. The general rule of thumb is:
  1. The 80 / 20 rule; 80% of the effort goes into the data and 20% goes into the user interface

 

Advantages

 

  • Star Schema

     

  1. Queries run faster against a Star Schema database
  2. A Star Schema structure reduces the time required to load large batches of data into a database
  3. A Star Schema is designed to enforce referential integrity of loaded data
  4. Navigating through data is efficient because dimensions are joined through fact tables
  • Snowflake Schema

     

  1. The main advantage of Snowflake Schema is the improvement of query performance due to minimized disk storage requirements and joining smaller lookup tables.
  2. It is easier to maintain.
  3. Increase flexibility.

Disadvantages

 

  • Star Schema

     

  1. Data integrity is not enforced as well as it is in a highly normalized database
  2. One-off inserts and updates can result in data anomalies which normalized schemas are designed to avoid.
  3. Is also not as flexible in terms of analytical needs as a normalized data model
  • Snowflake Schema

     

  1. The main disadvantage of the Snowflake Schema is the additional maintenance efforts needed to the increase number of lookup tables.
  2. Makes the queries much more difficult to create because more tables need to be joined.

Conclusion

 

In conclusion, one can actually say that both schema's have marit. In Qlikview, the prefered schema is the Star Schema as it provides us with queries that run faster. The structure also allow us to run large batches of data fasted and load data quickly. Users typical do not want to wait for data to display and this is important in Qlikview because it allows business heads to make important, crucial decisions depending on the data. If data takes forever to load, business users will have issues using the tool.
This however, does not completly rule out the use of a Snowflake structure in Qlikview.

There is room also for a Snowflake structure depending on the data iteslf as well as how you want to represent the data. it allow the structure to be easily maintained as dimensions and facts are not bombarded will many fields. This leads to better understanding of the data you are working on.
In the end, choosing an appropriate structure will solely depend on the data and the needs of the business.

References

 

  1. Snowflake Schema. 2015. Snowflake Schema. [ONLINE] Available at: http://www.1keydata.com/datawarehousing/snowflake-schema.html. [Accessed 07 October 2015].
  2. . 2015. . [ONLINE] Available at: http://cdn.ttgtmedia.com/rms/SearchBusinessIntelligence_IN/Star-vs-snowflake-image-two.png. [Accessed 07 October 2015].
  3. Star schema - Wikipedia, the free encyclopedia. 2015. Star schema - Wikipedia, the free encyclopedia. [ONLINE] Available at: https://en.wikipedia.org/wiki/Star_schema. [Accessed 07 October 2015].
  4. Qlik Design Blog : A Myth about the Number of Hops | Qlik Community. 2015. Qlik Design Blog : A Myth about the Number of Hops | Qlik Community. [ONLINE] Available at: https://community.qlik.com/blogs/qlikviewdesignblog/2015/01/19/number-of-hops. [Accessed 08 October 2015].
  5. Performance w/ multiple fact tables - star sche... | Qlik Community. 2015. Performance w/ multiple fact tables - star sche... | Qlik Community. [ONLINE] Available at: https://community.qlik.com/message/360786#360786. [Accessed 10 October 2015].
  6. . 2016. . [ONLINE] Available at: http://source.entelect.co.za/to-star-or-to-snowflake-in-qlikview. [Accessed 11 March 2016].