Migrate Spark job to BigQuery

I have just finished a work about migrating Spark job to BigQuery, or more precisely: migrate Python code to SQL. It’s a tedious work but improve the performance significantly: from 4 hours runtime of PySpark to half an hour on BigQuery (Honors belongs to the BigQuery!).

There are a few notes for the migration, or just SQL skills:

To create or overwrite a temporary table:

CREATE OR REPLACE TEMP TABLE `my_temp_tbl` AS ...

Python

xxxxxxxxxx

CREATE OR REPLACE TEMP TABLE `my_temp_tbl` AS ...

2. Select all columns from a table except some special ones:

SELECT * EXCEPT(year, month, day) FROM ...

Python

xxxxxxxxxx

SELECT * EXCEPT(year, month, day) FROM ...

3. To do pivot() on BigQuery: https://hoffa.medium.com/easy-pivot-in-bigquery-one-step-5a1f13c6c710. The key is clause EXECUTE IMMEDIATE which works like eval() in Python: take string as input and run it as SQL snippet.

4. Using clause OFFSET with LIMIT is terribly slow when the table is very big. The best solution for me is that use “bq extract” to export data to GCS as parquet files, and then get each part of these files by a program.

5. The parquet files could use column names that contain a hyphen, like “last-year”, “real-name”. But the BigQuery only support columns with underline, like “last_year”, “real_name”. So the “bq load” will automatically transfer column name “last-year” in the parquet file to “last_year” in the table of BigQuery.

Migrate Spark job to BigQuery

Migrate Spark job to BigQuery

Like this:

Recommend

Regulators Cannot Actually Ban Bitcoin

Trace memory error of CUDA program

PayPal 买到的不再是纸比特了

Linux虚拟内存，你理解到位了？

NVIDIA第一季度收入达到创纪录的56.6亿美元，同比暴涨84%

SanDisk这个Pro-Dock 4看着像NAS，其实是个Thunderbolt 3读卡器

Netflix可能会推出游戏业务，游戏或会基于自家剧集

AMD的Zen 5架构CPU代号是Granite Ridge，APU则是Zen5和Zen4的混合架构？

Podcast Subscriptions vs. the App Store

产品案例：看网易云音乐，如何做「陌生人交友」

About Joyk