PostgreSQL query parallelism can be like a box of chocolates

Introduction

When PostgreSQL processes your SQL query, it creates a query plan and then executes it. If the PostgreSQL plan includes executing parts of your query in parallel, it has the potential to speed up your query.

Since the PostgreSQL query planner is a complex system, it can be challenging sometimes to predict query parallelism from simply looking at the query. Also, while some queries benefit greatly from parallelism, others may not benefit as much or not at all.

This brings us to a few important questions:

How do I know if parallelism will speed up my queries?
How do I know if I’m getting the maximum query performance benefit from parallelism?
If I’m seeing little-to-no query speed up from parallelism, can I change that?

In this blog post, we show you how to conduct an A/B comparison that should only take a few minutes to set up. We also explain how to analyze and take action on the results of your A/B comparison.

Testing a query’s parallelism potential in PostgreSQL

The test involves temporarily setting some parameters and measuring resulting query runtimes. You can do it all on your own system without downloading or installing anything. The only prerequisite is to use PostgreSQL 10 or higher (because earlier versions do not support query parallelism as well). Also, we recommend running queries during off-peak hours so that production workloads do not interfere with test results and vice versa.

For this test, we’ll be using session-only or transaction-specific parameters to our advantage. Once you close the session or abort the transaction, the parameters will be rolled back to their defaults. Thanks to this level of isolation, the global database configuration is not affected by your temporary changes. And hence, there is no risk in forgetting to undo a setting.

But, enough theory, let’s dive right into the practical part.

1. Check that the system is not under peak load (so that there’s no interference with the tests). You can query the pg_stat_activity table for verification.

2. Verify, that your PostgreSQL has at least a minimum amount of worker-processes available by reading out the max_worker_processes parameter. For instance:

SHOW max_worker_processes;

For the test, this parameters value should be 8 or more. If the result you are seeing fits, you can proceed with the next step. Otherwise, you should raise the limits value to 8 or more in your PostgreSQL configuration.

Note: you will need to restart PostgreSQL to activate changes to this setting. Furthermore, we recommend raising it in general. It may solve other potential issues, such as autovacuum running out of workers.

3. We now start a transaction and set key parameters prior to executing the query. Eventually, the transaction gets aborted to not cause any database changes. That way, you can not only run SELECT-only queries but also the ones with INSERT, UPDATE and DELETE. They will not modify the database as the transaction gets rolled back.

We start by looking at the query of your choice without any parallelism. For instance:

BEGIN;
SET max_parallel_workers_per_gather = 0;
\timing on
{Your Query}
\timing off
ABORT;

Replace “{Your Query}” with the query you want to run and jot down the query time once it finishes. For instance:

tpch_sf100=# BEGIN;
BEGIN
tpch_sf100=# SET max_parallel_workers_per_gather = 0;
SET
tpch_sf100=# \timing on
Timing is on.
tpch_sf100=# SELECT COUNT(*)
tpch_sf100-# FROM orders
tpch_sf100-# WHERE o_orderdate BETWEEN '1996-01-01' AND '1996-06-30';
  count
----------
 11347366
(1 row)

Time: 1370.341 ms (00:01.370)
tpch_sf100=# \timing off
Timing is off.
tpch_sf100=# ABORT;
ROLLBACK
tpch_sf100=#

Our example query took 1.37s to finish.

We can retrieve the query plan by running the query again but with EXPLAIN ANALYZE. This is the plan of our example query:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=317348.62..317348.63 rows=1 width=8) (actual time=1762.263..1762.263 rows=1 loops=1)
   ->  Index Only Scan using idx_orders_orderdate on orders  (cost=0.57..288997.95 rows=11340269 width=0) (actual time=0.092..1287.946 rows=11347366 loops=1)
         Index Cond: ((o_orderdate >= '1996-01-01'::date) AND (o_orderdate <= '1996-06-30'::date))
         Heap Fetches: 0
 Planning Time: 0.367 ms
 JIT:
   Functions: 4
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 1.132 ms, Inlining 0.000 ms, Optimization 0.374 ms, Emission 5.498 ms, Total 7.003 ms
 Execution Time: 1763.483 ms
(10 rows)

As you can see, there is no parallelism visible in this plan.

Now, we repeat the experiment with the same query but this time we switch on parallelism. You can make use of the following piece of code, which helps force the PostgreSQL query planner to utilize up to 6 parallel workers (if the query is parallelizable at all):

BEGIN;
SET max_parallel_workers = 6;
SET max_parallel_workers_per_gather = 6;
SET parallel_leader_participation = off;
SET parallel_tuple_cost = 0;
SET parallel_setup_cost = 0;
SET min_parallel_table_scan_size = 0;
\timing on
{Your Query}
\timing off
ABORT;

Note: The settings above are suitable for our A/B test. I do not advise using them as-is in production without testing and tuning first. Here’s where to read the excellent PostgreSQL documentation on parallelism settings .

Follow the same procedure as before: replace “{Your Query}”, execute it, and note down the timing.

Some more console output for you:

tpch_sf100=# BEGIN;
BEGIN
tpch_sf100=#
tpch_sf100=# SET max_parallel_workers = 6;
SET
tpch_sf100=# SET max_parallel_workers_per_gather = 6;
SET
tpch_sf100=# SET parallel_leader_participation = off;
SET
tpch_sf100=# SET parallel_tuple_cost = 0;
SET
tpch_sf100=# SET parallel_setup_cost = 0;
SET
tpch_sf100=# SET min_parallel_table_scan_size = 0;
SET
tpch_sf100=# \timing on
Timing is on.
tpch_sf100=# SELECT COUNT(*)
tpch_sf100-# FROM orders
tpch_sf100-# WHERE o_orderdate BETWEEN '1996-01-01' AND '1996-06-30';
  count
----------
 11347366
(1 row)

Time: 373.622 ms
tpch_sf100=# \timing off
Timing is off.
tpch_sf100=# ABORT;
ROLLBACK
tpch_sf100=#

With the parallelism set to 6 parallel workers in total, our query took 0.37s. This is a speedup of 3.7x. Let’s have a look into the EXPLAIN ANALYZE plan:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=199220.85..199220.86 rows=1 width=8) (actual time=409.257..409.257 rows=1 loops=1)
   ->  Gather  (cost=199220.82..199220.83 rows=6 width=8) (actual time=399.038..402.165 rows=6 loops=1)
         Workers Planned: 6
         Workers Launched: 6
         ->  Partial Aggregate  (cost=199220.82..199220.83 rows=1 width=8) (actual time=358.284..358.284 rows=1 loops=6)
               ->  Parallel Index Only Scan using idx_orders_orderdate on orders  (cost=0.57..194495.71 rows=1890045 width=0) (actual time=0.063..278.356 rows=1891228 loops=6)
                     Index Cond: ((o_orderdate >= '1996-01-01'::date) AND (o_orderdate <= '1996-06-30'::date))
                     Heap Fetches: 0
 Planning Time: 0.395 ms
 JIT:
   Functions: 31
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 5.325 ms, Inlining 0.000 ms, Optimization 2.850 ms, Emission 40.859 ms, Total 49.034 ms
 Execution Time: 414.380 ms
(14 rows)

As you can see, it contains a “Workers Launched” statement with the value being 6. Meaning, the parts of the plan below this statement were executed with 6 parallel workers.

Repeat this two-step experiment with other queries you might be having or are interested in and write down their timings next to their respective category (parallelism and no-parallelism).

4. Compare your query times by looking at no-parallelism vs. parallelism. Note which ones ran faster with parallelism and which queries did not.

Analysis: What does this simple test tell you?

Based on the results you gathered, you now know which queries benefit from parallelism and how much faster they become. Your queries will fall into one of three categories:

Queries that did not get faster
Queries that performed slightly faster, (e.g., much less than linearly, like less than 6x with 6 workers, 4x with 4 workers, etc.)
Queries that performed much faster (linearly faster or better)

What can you do to improve the parallelism benefit?

Queries that did not get faster

Try rewriting your queries. PostgreSQL can parallelize some query actions better than others (and other parts not at all). Here are a few tips:

Inline your CTEs (or use the PostgreSQL 12 feature to automatically do it). Alternatively, make use of UNLOGGED tables.
Use CREATE TABLE AS for new tables as it executes the query in parallel.
Mark functions as PARALLEL SAFE when applicable.
For INSERT/UPDATE/DELETE statements with complex SQL statements in the body, consider moving these out to UNLOGGED tables.

Queries that got slightly faster

Assuming none of the tips above apply to your queries, then the queries in this category might be bound by hardware. Check your memory, number of cores, and your storage, especially with respect to its latency.

Queries that got much faster

In this category, it is likely safe to say that your queries are written for parallelism and they are running on adequate hardware. You might benefit from increasing the number of parallel workers even more. However, be aware, that PostgreSQL does limit it at automatically some point.

Will adding more parallel workers help?

The short answer is: it depends.

Our advice is: give it a shot by increasing the value for max_parallel_workers_per_gather . However, as mentioned above, PostgreSQL limits the parallelism at some point. For instance, you may want to try to set the value to the amount of CPU cores you have available and max_parallel_workers in accordance (for instance 2-3x larger). However, PostgreSQL might not utilize all of these workers since it will logarithmically limit the maximum amount. Typically, 12 or 14 is the upper limit.

Also, some scan types do not take too much parallelism well. One of these is the BITMAP HEAP SCAN which tends to be less performant when applying too much parallelism.

Last but not least, some query structures are executed in parallel, but their parallelism is distributed unevenly. Therefore, you might see the CPU utilization varying a lot throughout query execution. One possibility to improve this situation is to rewrite the query such, that PostgreSQL can create a more “distributed” query plan.

Can you improve the parallelism further with the Swarm64 DA extension?

Swarm64 DA is a PostgreSQL extension that improves parallelism (among other things to help speed up your queries). If PostgreSQL parallelism does not increase your query performance, then Swarm64 DA’s added parallelism is not likely to help speed those queries up either.

Queries which DO benefit from standard PostgreSQL parallelism are likely to be further accelerated (often 10x or more) with the improved parallel query planning and execution that Swarm64 DA provides. One observation you are likely to make is that there is not a linear relationship between PostgreSQL parallelism and the corresponding performance improvement. For example, we added 6x more resources in our example above, but the speed only increased by 3.7x. Hence, “simply” adding parallel resources may add much less speedup than you would expect. We illustrate this behavior in the following graph:

As you can see, the speed up is very far from a linear parallelization.

Why is this happening? This topic has been widely discussed at several PostgreSQL conferences (for instance during the PostgreSQL Conference Europe 2018 ). A short summarizing explanation is, that i) PostgreSQL does limit the maximum applicable parallelism and ii) additional parallelism possibly cannot be applied evenly throughout all parts of the query.

For example, consider a query with a nested SELECT statement. Will PostgreSQL parallelize it? For sure it will. However, it might spawn several different parallel gather nodes for this query. Thus, depending on the configuration of parameters, parts of the query might be parallelized more and others less.

We mentioned one solution to this is already: re-write the query such, that the query planner is able to create a more “linear” plan out of it. This ideally enables PostgreSQL to only spawn a single gather group and apply parallelism throughout the whole query. While this might cause more nodes in total to be spawned, it will make the overall processing leaner and more efficient, since the query can be executed in parallel throughout.

Swarm64 does add this functionality automatically by introducing additional nodes to the PostgreSQL query plan. They help to achieve exactly this: complex queries are transformed such, that they stay parallel throughout finalization and can be processed more efficiently. Hence, the graph from above eventually looks like this next graph.

Bottom line: Swarm64 adds the features required to approximate the ideal parallelism effect.

Conclusion

We showed you a very practical test with which you can observe and analyze the effects of parallelism on a set of queries. The more queries you experiment with, the more experience you will gain and eventually be able to understand and predict when PostgreSQL can apply parallelism to your queries and when not.

If your queries show any speed up when PostgreSQL parallelism is enabled, chances are high to get much further with Swarm64. You can possibly get 10x acceleration or more on these queries.

I’d love to hear your thoughts on the topic, and feel free to contact us if you want to discuss your use case or if you’d like to give Swarm64 DA a try.

Stay safe!

Introduction

Testing a query’s parallelism potential in PostgreSQL

Analysis: What does this simple test tell you?

What can you do to improve the parallelism benefit?

Queries that did not get faster

Queries that got slightly faster

Queries that got much faster

Will adding more parallel workers help?

Can you improve the parallelism further with the Swarm64 DA extension?

Conclusion

Recommend

IDEA 插件推荐：让你写出好代码的神器

厉害了！知道这样重命名文件都是大神级别！

Python and Go : Part I - gRPC

Not everything is UTF-8

使用PyTorch建立您的第一个文本分类模型

【Code】关于 GCN，我有三种写法

增强分析技术原理与实践

消费品品类速览：烘焙食品市场什么样？

张朝阳直播带货首秀带的不仅是好物而是价值平台

36氪首发 |「VIPidea」完成千万级天使轮融资，定位在线AI外教美术教育平台

About Joyk