66

Reading Delta Lake Tables natively in PowerBI

 3 years ago
source link: https://blog.gbrueckl.at/2021/01/reading-delta-lake-tables-natively-in-powerbi/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Reading Delta Lake Tables natively in PowerBI

Working with analytical data platforms and big data on a daily basis, I was quite happy when Microsoft finally announced a connector for Parquet files back in November 2020. The Parquet file format is developed by the Apache foundation as an open-source project and has become a fundamental part of most data lake systems nowadays.

“Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.”

However, Parquet is just a file format and does not really support you when it comes to data management. Common data manipulation operations (DML)  like updates and deletes still need to be handled manually by the data pipeline. This was one of the reasons why Delta Lake (delta.io) was developed besides a lot of other features like ACID transactions, proper meta data handling and a lot more. If you are interested in the details, please follow the link above.

So what is a Delta Lake table and how is it related to Parquet? Basically a Delta Lake table is a folder in your Data Lake (or wherever you store your data) and consists of two parts:

  1. Delta log files (in the sub-folder _delta_log)
  2. Data files (Parquet files in the root folder or sub-folders if partitioning is used)

The Delta log persists all transactions that modified the data or meta data in the table. For example, if you execute an INSERT statement, a new transaction is created in the Delta log and a new file is added to the data files which is referenced by the Delta log. If a DELETE statement is executed, a particular set of data files is (logically) removed from the Delta log but the data file still resides in the folder for a certain time. So we cannot just simply read all Parquet files in the root folder but need to process the Delta log first so we know which Parquet files are valid for the latest state of the table.

These logs are usually stored as JSON files (actually JSONL files to be more precise). After 10 transactions, a so-called checkpoint-file is created which is in Parquet format and stores all transactions up to that point in time. The relevant logs for the final table are then the combination of the last checkpoint-file and the JSON files that were created afterwards. If you are interested in all the details on how the Delta Log works, here is the full Delta Log protocol specification.

From those logs we get the information which Parquet files in the main folder must be processed to obtain the final table. The content of those Parquet files can then simply be combined and loaded into PowerBI.

I encapsulated all this logic into a custom Power Query function which takes the folder listing of the Delta table folder as input and returns the content of the Delta table. The folder listing can either come from an Azure Data Lake Store, a local folder, or an Azure Blob Storage. The mandatory fields/columns are [Content], [Name] and [Folder Path]. There is also an optional parameter which allows you the specify further options for reading the Delta table like the Version  if you want to use time-travel. However, this is still experimental and if you want to get the latest state of the table, you can simply omit it.

The most current M-code for the function can be found in my Github repository for PowerBI: fn_ReadDeltaTable.pq and will also be constantly updated there if I find any improvement.
The repository also contains an PowerBI desktop file (.pbix) where you can see the single steps that make up for the final function.

Once you have added the function to your PowerBI / Power Query environment you can call it like this:

= fn_ReadDeltaTable(
    AzureStorage.DataLake(
        "https://myadls.dfs.core.windows.net/public/data/MyDeltaTable.delta",
        [HierarchicalNavigation = false]),
    [Version = 12])

I would further recommend to nest your queries and separate the access to the storage (e.g. Azure Data Lake Store) and the reading of the table (execution of the function). If you are reading for an ADLS, it is mandatory to also specify [HierarchicalNavigation = false] !
If you are reading from a blob storage, the standard folder listing is slightly different and needs to be changed:

    Source = AzureStorage.Blobs("https://myadls.blob.core.windows.net/public"),
    #"Filtered Rows" = Table.SelectRows(Source, each Text.StartsWith([Name], "data/MyDeltaTable.delta")),
    #"Added FullPath" = Table.AddColumn(#"Filtered Rows", "FullPath", each [Folder Path] & "/" & [Name], Text.Type),
    #"Removed Columns" = Table.RemoveColumns(#"Added FullPath",{"Name", "Folder Path"}),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Removed Columns", "FullPath", Splitter.SplitTextByEachDelimiter({"/"}, QuoteStyle.Csv, true), {"Folder Path", "Name"}),
    #"Append Delimiter" = Table.TransformColumns(#"Split Column by Delimiter",{{"Folder Path", (_) => _ & "/", type text}})
    #"Append Delimiter"

Right now the connector/function is still experimental and performance is not yet optimal. But I hope to get this fixed in the near future to have a native way to read and finally visualize Delta lake tables in PowerBI.

Known limitations:

  • Partitioned tables
    • currently columns used for partitioning will always have the value NULL
    • values for partitioning columns are not stored as part of the parquet file but need to be derived from the folder path
  • Performance
    • is currently not great but this is mainly related to the Parquet connector as it seems
  • Time Travel
    • currently only supports “VERSION AS OF”
    • need to add “TIMESTAMP AS OF”
  • Predicate Pushdown / Partition Elimination
    • currently not supported – it always reads the whole table

Any feedback is welcome!

Special thanks also goes to Imke Feldmann (@TheBIccountant, blog) and Chris Webb (@cwebb_bi, blog) who helped me writing and tuning the PQ function!

Downloads: fn_ReadDeltaTable.pq (M-code)

Like this:

Loading...

Recommend

  • 8
    • www.smartdatacollective.com 3 years ago
    • Cache

    Improving Data Processing with Spark 3.0 & Delta Lake

    SmartData Collective > Big Data > Data Lake

  • 9
    • blog.knoldus.com 3 years ago
    • Cache

    Diving deeper into Delta Lake

    Diving deeper into Delta Lake Reading Time: 6 minutes In this blog, I am going to explain about delta lake. Now before tak...

  • 4
    • blog.knoldus.com 3 years ago
    • Cache

    Delta Lake To the Rescue

    Delta Lake To the Rescue Reading Time: 4 minutesWelcome Back. In our previous blogs, we tried to get some insights about Spark RDDs and also tried to explore some new things in Spark 2.4. You can go through thos...

  • 9
    • blog.knoldus.com 2 years ago
    • Cache

    Time Travel: Data versioning in Delta Lake

    Time Travel: Data versioning in Delta Lake Reading Time: 3 minutes In today’s Big Data world, we process large amounts of data continuously and store the resulting data into data lake. This keeps changing the s...

  • 9

    Native Delta Lake Connector for Presto ·Co-authors Denny Lee, Sr. Staff Developer Advocate at Databricks This is a joint publication by the PrestoDB and Delta...

  • 4

    数据湖表格式比较(Iceberg、Hudi 和 Delta Lake) 表...

  • 4
    • www.analyticsvidhya.com 2 years ago
    • Cache

    Top 10 Delta Lake Interview Questions

    This article was published as a part of the Data Science Blogathon. Introduction Delta lakes lakehouses have gained tremendous popularity due to the support...

  • 4
    • gist.github.com 2 years ago
    • Cache

    delta-lake-demo.md

    Databricks Delta Lake - A Friendly Intro This article introduces Databricks Delta Lake. A revolutionary storage layer that brings reliability and improve performance of data lakes using Apache Spark. First, we'll go through the dry...

  • 6

    Databricks Delta Lake 3.0 to counter Apache Iceberg tables The updates in Delta Lake 3.0 include a new universal table format, dubbed UniForm, a Delta Ke...

  • 5

    在BI报表的设计中,配色方案往往成为一大难题,一组切合主题、搭配合理的颜色设计往往能为我们的报表,加分不少。 今天,就介绍一个AI配色的网站,利用AI为pbi报表进行配色设计。 一:网站网址 https://aicolors.co/

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK