Comparing The Performance Of CSV And Excel Data Sources In Power Query

My posts from two weeks ago (see here and here) on using Process Monitor to troubleshoot the performance of Power Query queries made me wonder about another question: how does the performance of reading data from CSV files compare to the performance of reading data from Excel files? I think most experienced Power Query users in either Power BI or Excel know that Excel data sources perform pretty badly but I had never done any proper tests. I’m not going to pretend that this post is a definitive answer to this question (and once again, I would be interested to hear your experiences) but hopefully it will be thought-provoking.

To start off, I took the 153.6MB CSV file used in my last few posts and built a simple query that applied a filter on one text column, then removed all but three columns. The query ran in 9 seconds and using the technique from my last post I was able to draw the following graph from a Process Monitor log file showing how Power Query reads data from the file:

Nothing very surprising there. Then I opened the same CSV file in Excel and saved the data as an xlsx file with one worksheet and no tables or named ranges; the resulting file was only 80.6MB. Finally I created a duplicate of the first query and pointed it to the Excel file instead. The resulting query ran in 59 seconds – around 6 times slower! Here’s a comparison with the performance of this query with the first query:

The black line in the graph above is the amount of data read (actually the offset values showing where in the file the data is read from, which is the same thing as a running total when Power Query is reading all the data) from the Excel file; the green line is the amount of data read from the CSV file (the same data shown in the first graph above). A few things to mention:

Running Process Monitor while this second query was refreshing had a noticeable impact on its performance – in fact it was almost 20 seconds slower
The initial values of 80 million bytes seem to be where data is read from the end of the Excel file. Maybe this is Power Query reading some file metadata? Anyway, it seems as though it takes 5 seconds before it starts to read the data needed by the query.
There’s a plateau between the 10 and 20 second mark where not much is happening; this didn’t happen consistently and may have been connected to the fact that Process Monitor was running

In any case you don’t need to study this too hard to understand that the performance of reading data from an xlsx format Excel file is terrible compared to the performance of reading data from a CSV. So, if you have a choice between these two formats, for example when downloading data, it seems fair to say that you should always choose CSV over xlsx.

Published by Chris Webb

My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel. View all posts by Chris Webb

Published August 2, 2018July 29, 2018

Comparing The Performance Of CSV And Excel Data Sources In Power Query

Comparing The Performance Of CSV And Excel Data Sources In Power Query

Share this:

Published by Chris Webb

Recommend

微软神秘SSD Z1000揭开面纱：主控厂商曾打败华为！

老婆下定极氪 001 了，车来之前要买什么呢？

如何提高设计还原质量和验收效率？

用大视觉模型开车，理想和清华叉院联手了

Refuted papers continue to be cited more than their failed replications: Can a...

Rubberband Social - Scrollable vlogs created in 10 sec | Product Hunt

关于 iPad pro12.9、MacBook air、pro 屏幕对比和选择

Anker Hot Deals活动提供高达 40% 的充电配件折扣，不到 40 小时

LG StandbyME 扬声器作为蓝牙扬声器推出，可兼作条形音箱

RedDog红狗携手顶流龙爸：2024乘龙腾飞，共创出圈营销

About Joyk