3

Migrate Spark job to BigQuery

 3 years ago
source link: http://www.donghao.org/2021/05/07/migrate-spark-job-to-bigquery/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Migrate Spark job to BigQuery

I have just finished a work about migrating Spark job to BigQuery, or more precisely: migrate Python code to SQL. It’s a tedious work but improve the performance significantly: from 4 hours runtime of PySpark to half an hour on BigQuery (Honors belongs to the BigQuery!).

There are a few notes for the migration, or just SQL skills:

  1. To create or overwrite a temporary table:
CREATE OR REPLACE TEMP TABLE `my_temp_tbl` AS ...
Python
xxxxxxxxxx
CREATE OR REPLACE TEMP TABLE `my_temp_tbl` AS ...

2. Select all columns from a table except some special ones:

SELECT * EXCEPT(year, month, day) FROM ...
Python
xxxxxxxxxx
SELECT * EXCEPT(year, month, day) FROM ...

3. To do pivot() on BigQuery: https://hoffa.medium.com/easy-pivot-in-bigquery-one-step-5a1f13c6c710. The key is clause EXECUTE IMMEDIATE which works like eval() in Python: take string as input and run it as SQL snippet.

4. Using clause OFFSET with LIMIT is terribly slow when the table is very big. The best solution for me is that use “bq extract” to export data to GCS as parquet files, and then get each part of these files by a program.

5. The parquet files could use column names that contain a hyphen, like “last-year”, “real-name”. But the BigQuery only support columns with underline, like “last_year”, “real_name”. So the “bq load” will automatically transfer column name “last-year” in the parquet file to “last_year” in the table of BigQuery.

Like this:

Loading...

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK