5

How to reduce your Power BI model size by 90%!

 3 years ago
source link: https://towardsdatascience.com/how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

How to reduce your Power BI model size by 90%!

Have you ever wondered what makes Power BI so fast and powerful when it comes to performance? Learn on a real-life example about data model optimization and general rules for reducing data model

https://www.pexels.com/photo/aperture-blur-downtown-focus-339379/

Have you ever wondered what makes Power BI so fast and powerful when it comes to performance? So powerful, that it performs complex calculations over millions of rows in a blink of an eye.

In this article, we will dig deep to discover what is “under the hood” of Power BI, how your data is being stored, compressed, queried, and finally, brought back to your report. Once you finish reading, I hope that you will get a better understanding of the hard work happening in the background and appreciate the importance of creating an optimal data model in order to get maximum performance from the Power BI engine.

First look under the hood — Formula Engine & Storage Engine

First, I want you to meet the VertiPaq engine, “brain & muscles” of the system behind not only Power BI, but also SSAS Tabular and Excel Power Pivot. Truth to be said, VertiPaq represents only one part of the storage engine within the Tabular model, besides DirectQuery, which we will talk about in one of the next articles.

When you send the query to get data for your Power BI report, here is what happens:

  • Formula Engine (FE) accepts the request, process it, generates the query plan and finally executes it
  • Storage Engine (SE) pulls the data out of the Tabular model to satisfy the request issued within the query generated by the Formula Engine

Storage Engine works in two different ways in order to retrieve requested data: VertiPaq keeps the snapshot of the data in-memory. This snapshot can be refreshed from time to time, from the original data source.

On the opposite, DirectQuery doesn’t store any data. It just forwards the query straight to the data source for every single request.

1*7sk70QloFaDkxYiN428Hcg.jpeg?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
https://unsplash.com/photos/2EHRBs1gefY

Data in the Tabular model is usually stored either as an in-memory snapshot (VertiPaq) or in DirectQuery mode. However, there is also a possibility of implementing a hybrid Composite model, which relies on both architectures in parallel.

Formula Engine — “Brain” of Power BI

As I already stressed, Formula Engine accepts the query, and since it’s able to “understand” DAX (and MDX also, but it is out of the scope of this series), it “translates” DAX into a specific query plan, consisting of physical operations that need to be executed in order to get results back.

Those physical operations can be joins between multiple tables, filtering, or aggregations. It’s important to know that Formula Engine works in a single-threaded way, which means that requests to Storage Engine are always being sent sequentially.

Storage Engine — “Muscles” of Power BI

Once the query been generated and executed by the Formula Engine, the Storage Engine comes into the scene. It physically goes through the data stored within the Tabular model (VertiPaq) or goes directly to a different data source (SQL Server for example, if DirectQuery storage mode is in place).

When it comes to specifying the storage engine for the table, there are three possible options to choose between:

  • Import mode — based on VertiPaq. Table data is being stored in-memory as a snapshot. Data can be refreshed periodically
  • DirectQuery mode — data is being retrieved from the data source at the query time. Data resides in its original source before, during and after the query execution
  • Dual mode — a combination of the first two options. Data from the table is being loaded into memory, but at the query time it can be also retrieved directly from the source

As opposed to Formula Engine that doesn’t support parallelism, the Storage Engine can work asynchronously.

Meet VertiPaq Storage Engine

As we have drawn a big picture previously, let me explain in more detail what VertiPaq does in the background to boost the performance of our Power BI reports.

When we choose Import mode for our Power BI tables, VertiPaq performs the following actions:

  • Reads the data source, transforms data into a columnar structure, encodes and compresses data within each of the columns
  • Establishes a dictionary and index for each of the columns
  • Prepares and establishes relationships
  • Computes all calculated columns and calculated tables and compresses them

Two main characteristics of VertiPaq are:

  1. VertiPaq is a columnar database
  2. VertiPaq is an in-memory database
1*z1zU4SMSGClORlR_UFs1Hw.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

As you can see in the illustration above, columnar databases stores and compresses data in a different way to traditional row-store databases. Columnar databases are optimized for vertical data scanning, which means that every column is structured in its own way and physically separated from other columns!

Without going into deep analysis about advantages and drawbacks between row-store vs column-store databases, since it would require a separate series of articles, let me just pinpoint few key differentials in terms of performance.

With columnar databases, single column access is fast and effective. Once the computation starts to involve multiple columns, things become more complex, as intermediary steps’ results need to be temporarily stored in some way.

Simply said, columnar databases are more CPU exhaustive, while row-store databases increase I/O, because of many scans of useless data.

So far, we painted a big picture of the architecture which enables Power BI to fully shine as an ultimate BI tool. Now, we are ready to dive deeper into specific architectural solutions and consequentially leverage this knowledge to make the most of our Power BI reports, by tuning our data model to extract the maximum from the underlying engine.

Inside VertiPaq in Power BI — Compress for success!

1*kO7yh08UrN5YFNU2Zg5rFg.jpeg?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
www.pexels.com/photo/crop-man-crushing-plastic-bottle-4498091/

As you might recall from the previous part of this article, we scratched the surface of VertiPaq, a powerful storage engine, which is “responsible” for the blazing-fast performance of most of your Power BI reports (whenever you are using Import mode or Composite model).

3, 2, 1…Fasten your seatbelts!

One of the key characteristics of the VertiPaq is that it’s a columnar database. We learned that columnar databases store data optimized for vertical scanning, which means that every column has its own structure and is physically separated from other columns.

That fact enables VertiPaq to apply different types of compression to each of the columns independently, choosing the optimal compression algorithm based on the values in that specific column.

Compression is being achieved by encoding the values within the column. But, before we dive deeper into a detailed overview of encoding techniques, just keep in mind that this architecture is not exclusively related to Power BI — in the background is a Tabular model, which is also “under the hood” of SSAS Tabular and Excel Power Pivot.

Value Encoding

This is the most desirable value encoding type since it works exclusively with integers and, therefore, require less memory than, for example, when working with text values.

How does this look in reality? Let’s say we have a column containing a number of phone calls per day, and the value in this column varies from 4.000 to 5.000. What the VertiPaq would do, is to find the minimum value in this range (which is 4.000) as a starting point, then calculate the difference between this value and all the other values in the column, storing this difference as a new value.

1*0byiDcD5zeTHJ8QOJt_mbw.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

At first glance, 3 bits per value might not look like a significant saving, but multiply this by millions or even billions of rows and you will appreciate the amount of memory saved.

As I already stressed, Value Encoding is being applied exclusively to integer data type columns (currency data type is also stored as an integer).

Hash Encoding (Dictionary Encoding)

This is probably the most used compression type by a VertiPaq. Using Hash encoding, VertiPaq creates a dictionary of the distinct values within one column and afterward replaces “real” values with index values from the dictionary.

Here is an example to make things more clear:

1*ElbN-Q6zGmfNS-krx9TftQ.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

As you may notice, VertiPaq identified distinct values within the Subjects column, built a dictionary by assigning indexes to those values, and finally stored index values as pointers to “real” values. I assume you are aware that integer values require way less memory space than text, so that’s the logic behind this type of data compression.

Additionally, by being able to build a dictionary for any data type, VertiPaq is practically data type independent!

This brings us to another key takeover: no matter if your column is of text, bigint or float data type — from VertiPaq perspective it’s the same — it needs to create a dictionary for each of those columns, which implies that all these columns will provide the same performance, both in terms of speed and memory space allocated! Of course, by assuming that there are no big differences between dictionary sizes between these columns.

So, it’s a myth that the data type of the column affects its size within the data model. On the opposite, the number of distinct values within the column, which is known as cardinality, mostly influence column memory consumption.

RLE (Run-Length-Encoding)

The third algorithm (RLE) creates a kind of mapping table, containing ranges of repeating values, avoiding to store every single (repeated) value separately.

Again, taking a look at an example will help to better understand this concept:

1*_8H6sFJXVhBbn46z1cYo2g.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

In real life, VertiPaq doesn’t store Start values, because it can quickly calculate where the next node begins by summing previous Count values.

As powerful as it might look at first glance, the RLE algorithm is highly dependent on the ordering within the column. If the data is stored the way you see in the example above, RLE will perform great. However, if your data buckets are smaller and rotate more frequently, then RLE would not be an optimal solution.

One more thing to keep in mind regarding RLE: in reality, VertiPaq doesn’t store data the way it is shown in the illustration above. First, it performs Hash encoding and creating a dictionary of the subjects and then apply RLE algorithm, so the final logic, in its most simplified way, would be something like this:

1*vQqTIJjr9MkKwDz3wEXv1A.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

So, RLE occurs after Value or Hash Encoding, in those scenarios when VertiPaq “thinks” that it makes sense to compress data additionally (when data is ordered in that way that RLE would achieve better compression).

Re-Encoding considerations

No matter how “smart” VertiPaq is, it can also make some bad decisions, based on incorrect assumptions. Before I explain how re-encoding works, let me just briefly iterate through the process of data compression for a specific column:

  • VertiPaq scans sample of rows from the column
  • If the column data type is not an integer, it will look no further and use Hash encoding
  • If the column is of integer data type, some additional parameters are being evaluated: if the numbers in the sample linearly increase, VertiPaq assumes that it is probably a primary key and chooses Value encoding
  • If the numbers in the column are reasonably close to each other (the number range is not very wide, like in our example above with 4.000–5.000 phone calls per day), VertiPaq will use Value encoding. On the contrary, when values fluctuate significantly within the range (for example between 1.000 and 1.000.000), then Value encoding doesn’t make sense and VertiPaq will apply the Hash algorithm

However, it can happen sometimes that VertiPaq makes a decision about which algorithm to use based on the sample data, but then some outlier pops-up and it needs to re-encode the column from scratch.

Let’s use our previous example for the number of phone calls: VertiPaq scans the sample and chooses to apply Value encoding. Then, after processing 10 million rows, all of a sudden it found a 500.000 value (it can be an error, or whatever). Now, VertiPaq re-evaluates the choice and it can decide to re-encode the column using the Hash algorithm instead. Surely, that would impact the whole process in terms of the time needed for reprocessing.

Finally, here is the list of parameters (in order of importance) that VertiPaq considers when choosing which algorithm to use:

  • Number of distinct values in the column (Cardinality)
  • Data distribution in the column — column with many repeating values can be better compressed than one containing frequently changing values (RLE can be applied)
  • Number of rows in the table
  • Column data type — impacts only dictionary size

Reducing the data model size by 90% — real story!

After we laid the theoretical ground for understanding the architecture behind the VertiPaq storage engine, and which types of compression it uses to optimize your Power BI data model, it’s the right moment to make our hands dirty and apply our knowledge in real-life case!

Starting point = 777 MB

Our data model is quite simple, yet memory exhaustive. We have a fact table (factChat) which contains data about live support chats and one dimension table (dimProduct), that relates to a fact table. Our fact table has around 9 million rows, which should not be a big deal for Power BI, but the table was imported as it is, without any additional optimization or transformation.

1*p9SiaEQ6505TBfjZLlLJZQ.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

Now, this pbix file consumes a whopping 777 MB!!! You can’t believe it? Just take a look:

1*UbFSJMutlAIIpPkIsU3ujA.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

Just remember this picture! Of course, I don’t need to tell you how much time this report needs to load or refresh, and how our calculations are slow because of the file size.

…and it’s even worse!

Additionally, it’s not just 777 MBs that takes our memory, since memory consumption is being calculated taking into account the following factors:

  • PBIX file
  • Dictionary (you’ve learned about the dictionary in this article)
  • Column hierarchies
  • User-defined hierarchies
  • Relationships

Now, if I open Task Manager, go to the Details tab and find the msmdsrv.exe process, I will see that it burns more than 1 GB of the memory!

1*EwIgsMtiaZ__3CH25T-4-A.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

Oh, man, that really hurts! And we haven’t even interacted with the report! So, let’s see what we can do to optimize our model…

Rule #1 — Import only those columns you really need

The first and the most important rule is: keep in your data model only those columns you really need for the report!

That being said, do I really need here both the chatID column, which is a surrogate key, and sourceID column, which is a primary key from the source system. Both these values are unique, so even if I need to count the total number of chats, I would still be fine with only one of them.

1*NsBl22hg8fARFu4N7Xn0GA.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

Let me check how the file looks now:

1*qPPKLPhp1GG7giUr5vK15A.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

By removing just one unnecessary column, we saved more than 100 MB!!! Let’s examine further what can be removed without taking a deeper look (and we will come later into this, I promise).

Do we really need both the original start time of the chat and UTC time, one stored as a Date/Time/Timezone type, the other as Date/Time, and both going to a second precision??!!

Let me get rid of the original start time column and keep only UTC values.

1*Hwwr4qWsGiHN7Ts62KV5RA.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

Another 100 MB of wasted space gone! By removing just two columns we don’t need, we reduced the size of our file by 30%!

Now, that was without even looking into more details of the memory consumption. Let’s now turn on DAX Studio, my favorite tool for troubleshooting Power BI reports. As I already stressed a few times, this tool is a MUST if you plan to work seriously with Power BI — and it’s completely free!

One of the features in DAX Studio is a VertiPaq Analyzer, a very useful tool built by Marco Russo and Alberto Ferrari from sqlbi.com. When I connect to my pbix file with DAX Studio, here are the numbers related to my data model size:

1*Oq4whtBLuBGhmRbY3vEb2g.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

I can see here what are the most expensive columns in my data model and decide if I can discard some of them, or do I need to keep them all.

At first glance, I have few candidates for removal — sessionReferrer and referrer columns have high cardinality and therefore can’t be optimally compressed. Moreover, as these are text columns and need to be encoded using a Hash algorithm, you can see that their dictionary size is extremely high! If you take a closer look, you can notice that these two columns take almost 40% of my table size!

1*RK0aSM5kPgpY0-tLGNrQgQ.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

After checking with my report users if they need any of these columns, or maybe only one of them, I’ve got a confirmation that they don’t perform any analysis on those columns. So, why on Earth should we bloat our data model with them??!!

Another strong candidate for removal is the LastEditDate column. This column just shows the date and time when the record was last edited in the data warehouse. Again checked with report users, they didn’t even know that this column exists!

I removed these three columns and the result is:

1*o7Nuvi1OhqDr-b71nFtZ-A.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

Oh, god, we halved the size of our data model by just removing a few unnecessary columns.

Truth be said, there are a few more columns that could be dismissed from the data model, but let’s now focus on other techniques for data model optimization.

Rule #2 — Reduce the column cardinality!

As you may recall from my previous article, the rule of thumb is: the higher the cardinality of a column, the harder for VertiPaq to optimally compress the data. Especially, if we are not working with integer values.

Let’s take a deeper look into VertiPaq Analyzer results:

1*W-jsI3syVvl0X76Q9cvrFw.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

As you see, even if the chatID column has higher cardinality than the datetmStartUTC column, it takes almost 8 times less memory! Since it is a surrogate key integer value, VertiPaq applies Value encoding, and the size of a dictionary is irrelevant. On the other hand, Hash encoding is being applied for the column of date/time data type with high cardinality, so the dictionary size is enormously higher.

There are multiple techniques for reducing the column cardinality, such as splitting columns. Here are a few examples of using this technique.

For Integer columns, you can split them into two even columns using division and modulo operations. In our case, it would be:

SELECT chatID/1000 AS chatID_div
,chatID % 1000 AS chatID_mod……….

This optimization technique must be performed on the source side (in this case by writing a T-SQL statement). If we use the calculated columns, there is no benefit at all, since the original column has to be stored in the data model first.

A similar technique can bring significant savings when you have decimal values in the column. You can just simply split values before and after the decimal as explained in this article.

Since we don’t have any decimal values, let’s focus on our problem — optimizing the datetmStartUTC column. There are multiple valid options to optimize this column. The first is to check if your users need granularity higher than the day level (in other words, can you remove hours, minutes, and seconds from your data).

Let’s check what savings would this solution bring:

1*fR5lVy41YO9ggTqka9tAFw.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

The first thing we notice is that our file is now 255 MB, so 1/3 from what we started from. VertiPaq Analyzer’s results show that this column is now almost perfectly optimized, going from taking over 62% of our data model to just slightly over 2.5%! That’s huuuuge!

1*YYWRxqK8-NJC-jZ8hfAfGw.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

However, it appeared that day level grain was not fine enough and my users needed to analyze figures on hour level. OK, so we can at least get rid of minutes and seconds and that would also decrease the cardinality of the column.

So, I’ve imported values rounded per hour:

SELECT chatID
,dateadd(hour, datediff(hour, 0, datetmStartUTC), 0) AS datetmStartUTC
,customerID
,userID
,ipAddressID
,productID
,countryID
,userStatus
,isUnansweredChat
,totalMsgsOp
,totalMsgsUser
,userTimezone
,waitTimeSec
,waitTimeoutSec
,chatDurationSec
,sourceSystem
,subject
,usaccept
,transferUserID
,languageID
,waitFirstClick
FROM factChat

It appeared that my users also didn’t need a chatVariables column for analysis, so I’ve also removed it from the data model.

Finally, after disabling Auto Date/Time in Options for Data Load, my data model size was around 220 MB! However, one thing still bothered me: the chatID column was still occupying almost 1/3 of my table. And this is just a surrogate key, which is not used in any of the relationships within my data model.

1*n1Mv7ICw_fww8bNiZlpjbQ.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

So, here I was examining two different solutions: the first was to simply remove this column and aggregate number of chats, counting them using GROUP BY clause:

SELECT count(chatID) chatID
,dateadd(hour, datediff(hour, 0, datetmStartUTC), 0) datetmStartUTC
,customerID
,userID
,ipAddressID
,productID
,countryID
,userStatus
,isUnansweredChat
,totalMsgsOp
,totalMsgsUser
,userTimezone
,waitTimeSec
,waitTimeoutSec
,chatDurationSec
,sourceSystem
,subject
,usaccept
,transferUserID
,languageID
,waitFirstClick
FROM factChat
GROUP BY dateadd(hour, datediff(hour, 0, datetmStartUTC), 0)
,customerID
,userID
,ipAddressID
,productID
,countryID
,userStatus
,isUnansweredChat
,totalMsgsOp
,totalMsgsUser
,userTimezone
,waitTimeSec
,waitTimeoutSec
,chatDurationSec
,sourceSystem
,subject
,usaccept
,transferUserID
,languageID
,waitFirstClick

This solution would also reduce the number of rows since it will aggregate chats grouped by defined attributes — but the main advantage is that it will drastically reduce the cardinality of chatID column, as you can see in the next illustration:

1*Fgb0v6XCsJN7pCbPg-7rVg.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

So, we went down from “9 million and something” cardinality to just 13!!! And this column’s memory consumption is now not worth mentioning anymore. Obviously, this also reflected on our pbix file size:

1*4sIC0gsI9O4ZAkpvECQowQ.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

Also, there would be no benefit by keeping the chatID column at all, since it’s not being used anywhere in our data model. Once I’ve removed it from the model, we saved an additional 3 MB, but preserved original granularity of the table!

And one last time, let’s check the pbix file size:

1*zu_qCrik5wOo9cnLr1leNQ.png?q=20
how-to-reduce-your-power-bi-model-size-by-90-76d7c4377f2d
Image by author

Please recall the number we started at: 777 MB! So, I’ve managed to reduce my data model size by almost 90%, applying some simple techniques which enabled the VertiPaq storage engine to perform more optimal compression of the data.

And this was a real use-case, which I faced during the last year!

General rules for reducing data model size

To conclude, here is the list of general rules you should keep in mind when trying to reduce the data model size:

  • Keep only those columns your users need in the report! Just sticking with this one single rule will save you an unbelievable amount of space, I assure you…
  • Try to optimize column cardinality whenever possible. The golden rule here is: test, test, test…and if there is a significant benefit from, for example, splitting one column into two, or to substitute decimal column with two whole number columns, then do it! But, also keep in mind that your measures need to be rewritten to handle those structural changes, in order to display expected results. So, if your table is not big, or if you have to rewrite hundreds of measures, maybe it’s not worth splitting the column. As I said, it depends on your specific scenario, and you should carefully evaluate which solution makes more sense
  • Same as for columns, keep only those rows you need: for example, maybe you don’t need to import data from the last 10 years, but only 5! That will also reduce your data model size. Talk to your users, ask them what they really need, before blindly putting everything inside your data model
  • Aggregate your data whenever possible! That means — fewer rows, lower cardinality, so all nice things you are aiming to achieve! If you don’t need hours, minutes, or seconds level of granularity, don’t import them! Aggregations in Power BI (and Tabular model in general) are a very important and wide topic, which is out of the scope of this series, but I strongly recommend you to check Phil Seamark’s blog and his series of posts on creative aggregations usage
  • Avoid using calculated columns whenever possible, since they are not being optimally compressed. Instead, try to push all calculations to a data source (SQL database for example) or perform them using the Power Query editor
  • Use proper data types (for example, if your data granularity is on a day level, there is no need to use Date/Time data type. Date data type will suffice)
  • Disable Auto Date/Time option for data loading (this will remove a bunch of automatically created date tables in the background)

Conclusion

After you learned the basics of the VertiPaq storage engine and different techniques it uses for data compression, I wanted to wrap up this article by showing you a real-life example of how we can “help” VertiPaq (and Power BI consequentially) to get the best out of report performance and optimal resource consumption.

Thanks for reading, hope that you enjoyed the article!

Subscribe here to get more insightful data articles!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK