7

Reading Parquet metadata in Power Query in Power BI

 7 months ago
source link: https://blog.crossjoin.co.uk/2024/02/11/reading-parquet-metadata-in-power-query-in-power-bi/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Reading Parquet Metadata In Power Query In Power BI

Chris Webb M, Power Query February 11, 2024February 10, 2024

1 Minute

There’s a new M function in Power Query in Power BI that allows you to read the data from a Parquet file: Parquet.Metadata. It’s not documented yet and it’s currently marked as “intended for internal use only” but I’ve been told I can blog about it. Here’s an example of how to use it:

let
Source = Parquet.Metadata(File.Contents("C:\myfile.snappy.parquet"))
in
Source

…and here’s an example of the output:

image-4.png?resize=614%2C162&ssl=1

This query shows how to expand the record returned by this function into a table:

let
m = Parquet.Metadata(File.Contents("C:\myfile.snappy.parquet")),
schema = List.Accumulate(Table.ToRecords(m[Schema]), [], (x, y) => if y[NumChildren] = null then Record.AddField(x, y[Name], y[LogicalType] ?? y[ConvertedType]) else x),
expanded1 = Table.ExpandTableColumn(m[RowGroups], "Columns", {"MetaData"}),
renamed1 = Table.RenameColumns(expanded1, {{"Ordinal", "RowGroup"}, {"TotalCompressedSize", "RowGroupCompressedSize"}, {"TotalByteSize", "RowGroupSize"}}),
expanded2 = Table.ExpandRecordColumn(renamed1, "MetaData", {"Type", "Encodings", "PathInSchema", "Codec", "NumValues", "TotalUncompressedSize", "TotalCompressedSize", "KeyValueMetadata", "DataPageOffset", "IndexPageOffset", "DictionaryPageOffset", "Statistics", "EncodingStats"}),
renamed2 = Table.RenameColumns(expanded2, {{"Type", "PhysicalType"}}),
added1 = Table.AddColumn(renamed2, "Column", each Text.Combine([PathInSchema])),
added2 = Table.AddColumn(added1, "Cardinality", each [Statistics][DistinctCount]),
added3 = Table.AddColumn(added2, "NullCount", each [Statistics][NullCount]),
added4 = Table.AddColumn(added3, "DictionarySize", each [DataPageOffset] - [DictionaryPageOffset]),
added5 = Table.AddColumn(added4, "LogicalType", each Record.FieldOrDefault(schema, [Column], null)),
selected = Table.SelectColumns(added5, {"RowGroup", "Column", "Codec", "NumValues", "Cardinality", "NullCount", "TotalCompressedSize", "TotalUncompressedSize", "DictionarySize", "PhysicalType", "LogicalType"})
in
selected

As you can see this gives you all kinds of useful information about a Parquet file such as the schema, the compression type used, column cardinality and so on.

[Thanks to Curt Hagenlocher for the tip-off and the query above]

Share this:

Published by Chris Webb

My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel. View all posts by Chris Webb

Published February 11, 2024February 10, 2024

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK