On Analytics and Everything else: 2017

Introduction

In Business Intelligence and data science there has always been this debate about whether a star schema is better than a snowflake schema or vice versa. This post, is not about that argument. This is a discussion that comes after the choice has been made.

Once you have chosen whether to go star or snowflake, there is the next question which is: "Is there a case whereby lumping everything into the fact with a few dimensions is better than having a streamlined fact table with many dimensions? "

In this discussion, I will try arguing for both cases from my experience thus far. For my debate regarding star vs snowflake, you may want to look at:

http://analyticsandeverythingelse.blogspot.co.za/2016/03/star-schema-vs-snowflake.html

Big Fact, few dimensions

Many people can argue to put everything into the fact table. This is not necessarily wrong per se. It takes all the hard work out of modelling correctly and making sure the data links together properly.

There is a case for this, the case to join everything into one gigantic, monstrous Fact table.

For example: If the de-normalising of an ultra-normalised database makes it hard to de-normalise, the pressure is there to just lump almost everything into the Fact and have a few dimensions. There are advantages and disadvantages.

Advantages

- All needed data is found in the Fact

- It is easier to query one table for fields instead of having many hops through dimensions.

Disadvantages

- Increases the size of the Fact table

- Reduces performance of a dashboard

- Reload time increases due to the huge volumes of the Fact table

- Un used fields are loaded into the dashboard

- Incorrect measures could be calculated due to potential duplicates arising from joins

Small Fact, Many Dimensions

As mentioned above, best practice is to have a Fact table with links/keys to dimensions and measures that will be used on the dashboard. The Business Intelligence tool, Power BI, does this well. Although other tools and cubes should also be built in this manner. Building your model like this also promotes a separation of concern.

We should always keep in mind that dimensions are for describing entities. Just as we have in development, a class should describe an object so too we have in Data warehousing that dimensions describe entities and the Fact is used to link all of them together with its measures.

Advantages

- The size of the Fact table is smaller

- Performance of a dashboard is faster

- Reload time decreases due to the low volumes of the Fact table

- Un used fields are not loaded into the dashboard

- Duplicates in Dimensions should not be allowed therefore more accurate measures

Disadvantages

- Linking to the Fact table must be precise

- Some measures could be incorrect if linking is not done well

Conclusion

In hindsight, it should be stated that that it is not best practice to do this. A Fact table should consist of several keys linking to different dimensions with many measures. This takes the burden off the fact table.

The dimensions should be used to describe an entity and the fact should be freed of this. The role of the Fact table should be to contain keys and measures. This measure will then be used to determine KPI’s on the front-end of the dashboard.

Modelling is 80% and creating the dashboard is only 20% of the whole work

Introduction

In all development areas, performance becomes an important consideration. Performance is a good efficiency measure that should be taken seriously. Some might say that just throw hardware at any performance issue and that will fix the issue but sadly that is not the case. Qlikview is a memory intense Business Intelligence tool itself and granted, throwing more memory at it would fix the problem but there are still many considerations to think about because you find in some organisations that budgets could be an issue so it’s important to streamline the model as much as possible.

In some cases, as the data grows, you find that there is a need to optimise your model but in principle, performance should be at the top of your agenda during development anyway.

Method

To effectively get all the areas needed for performance improvement, we would need a way to see the metadata of the model. Effectively this is all the information such as tables, dimensions, expressions, etc. that would map to the total usage of memory used. Rob Wunderlich, has written a document analyser that would allow us to do this.

Steps:

1. Go to: http://qlikviewcookbook.com/tools/#squelch-taas-accordion-shortcode-content-3 and download the QV Document Analyser.

2. Store it anywhere on your disk but preferably, where the model is stored

3. Create a copy, one that will have metadata before analysis and the other will have analysis after performance changes.

4. Open the analyser and point it to the model you want to improve.

a. In our case, we will be doing the Unit Testing model I built before.

Results

The data used for this test comes from the MySQL localhost sample data. There is roughly around 4000 rows in the largest table. Although the data is not substantial, it is enough to show the concepts that we are trying to pass through in this post. An increased sized data model would show this better.

Before Tuning

1. The analysis took 5 seconds

2. It took 1.3 seconds to open the document

3. 28 fields were used. (In Client models, you’d find a fair number of fields being loaded but never used)

4. It takes 141 milliseconds to open an object

5. Uses 1.07MB of RAM

As we can see with the recommendations, there are a few things we can do to improve this model and that is to autonumber two of the keys in the model. In a bigger model, the recommendations will be so much more, each showing the amount of memory you could save by implementing the changes.

The memory used before the recommendations were implemented stood at: 1,118,049 bytes.

After Tuning

1. The analysis took 5 seconds

2. It took 1.3 seconds to open the document

3. 28 fields were used. (In Client models, you’d find a fair number of fields being loaded but never used)

4. It takes 125 milliseconds to open an object

5. Uses 1.07MB of RAM

The memory used after the recommendations were implemented now stands at: 1,117,571 bytes

Analysis

The results above show the power of using the document analyser to improve your model. But this is just not the only way to improve performance. There are a number of other things that one can go about looking at to improve performance.

Remove Synthetic Keys

Synthetic keys are formed if there two or more tables have common columns exist. To solve this, we can rename the concerned fields, join the tables concerned or just not load the fields depending on the need of the model.

Use of Autonumber

“This script function returns a unique integer value for each distinct evaluated value of expression encountered during the script execution.” (https://help.qlik.com/en-US/qlikview/12.1/Subsystems/Client/Content/Scripting/CounterFunctions/autonumber.htm)

Using this function, we can create keys based on integers instead of strings or other data type that would use more system resources.

Model Towards a Star Schema

Although in some cases, there is a need for a snowflake schema because of the structure of the data, having a star schema greatly influences the performance of the model. This all has to do with the number of hops that Qlikview has to do. Please see: https://analyticsandeverythingelse.blogspot.co.za/2016/03/star-schema-vs-snowflake.html regarding hops.

Using Simple Expressions

It is worth noting that all complicated expressions should be done in the backend script, preferably in a Tier 2 model whereby all the heavy lifting will be done. The model will perform so much better if for example we had an expression such as: sum ({<isCalculated = {‘Yes’}>} Sales) instead of the whole calculation if isCalculated.

Dropping Temp tables and unused fields

By dropping unused fields and temporary tables, the model will remove these from memory and thus making it more efficient to use. If we load these, these will just reside in memory while not doing anything constructive to the overall usage of the model. Also, if left unattended, these could mess up your model and you could end up with a slow model.

Circular Loops

It is undesirable to have multiple common keys across multiple tables ia a Qlikview data structure. this may cause Qlikview to use Circular references to generate the connections in the data structure. Circular references are generally resource heavy and may slow down calculations and, in extreme cases, overload the application.To avoid this, we can:

Comment the fields in the load script
Rename the fields in the load script
Rename the fields using the UNQUALIFY operator

Conclusion

Performance is very important. In the case of Qlikview, this is important as Qlikview relies heavily memory. It is crucial that we do not build models that would otherwise kill over the machine. Having a lean, memory efficient model could be the difference in business making crucial decisions and being delayed due to the performance of an ill-designed model.

Tuesday 30 May 2017

One Table To Rule Them All

Introduction

Big Fact, few dimensions

Small Fact, Many Dimensions

Conclusion

Friday 24 February 2017

Improving Performance In Qlikview Apps

Introduction

Method

Results

As we can see with the recommendations, there are a few things we can do to improve this model and that is to autonumber two of the keys in the model. In a bigger model, the recommendations will be so much more, each showing the amount of memory you could save by implementing the changes.

The memory used before the recommendations were implemented stood at: 1,118,049 bytes.

The memory used after the recommendations were implemented now stands at: 1,117,571 bytes

Analysis

Conclusion

Blog Archive