9

Optimize ClickHouse performance using AWS Graviton3 - Infrastructure Solutions b...

 2 years ago
source link: https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/improve-clickhouse-performance-up-to-26-by-using-aws-graviton3
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Co-authors: Martin Ma and Zaiping Bie


Introduction

ClickHouse is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP). It supports best in the industry query performance, while significantly reducing storage requirements through the innovative use of columnar storage and compression. It has been very popular in the OLAP field for the past several years and has been widely used by many enterprises.

In this blog, we compare the query latency (processing time) and throughput of ClickHouse on two Amazon EC2 instance families over a range of instance sizes. These instance families are the Amazon EC2 C7g (based on Arm Neoverse-powered AWS Graviton3 processors) and C6i (based on 3rd Generation Intel Xeon Scalable processors). Our findings demonstrate that ClickHouse deployments on C7g instances can achieve up to 26% performance advantage over C6i instances. The following sections cover the details of our testing methodology and results.

Performance benchmark setup and result

For the benchmark setup, the ClickHouse server and client are deployed in different instances. We connect the ClickHouse client to the ClickHouse server and repeatedly send preset queries. We then collect query processing time and throughput to compare performance between C7g and C6i instances.

Build Config

To achieve the best performance, besides using the latest Clang to build ClickHouse per the official procedure, we also apply CMake NATIVE and AVX-related flags as following.

architecture

ClickHouse CMake flags

AArch64

-DARCH_NATIVE=ON

-DARCH_NATIVE=ON

-DENABLE_AVX2=ON

-DENABLE_AVX2_FOR_SPEC_OP=ON

-DENABLE_AVX512=ON

-DENABLE_AVX512_FOR_SPEC_OP=ON

To align jemalloc behavior on C7g and C6i, the following jemalloc parameters are configured in jemalloc_internal_defs.h.in.

jemalloc parameter

value

LG_PAGE

12 (One page is 2^LG_PAGE bytes)

LG_HUGEPAGE

21 (One huge page is 2^LG_HUGEPAGE bytes)

Server Config

The ClickHouse server runs on C7g/C6i instance families across a range of instance sizes.

The benchmark client runs on a single C7g.4xlarge instance.

The following table summarizes the tested instance types.

Instance Type

Instance Size (vCPU)

Memory (GiB)

Storage

C7g / C6i

2xlarge (8)

50GB (EBS gp3)

4xlarge (16)

8xlarge (32)

16xlarge (64)

The software versions and test parameters are as following:

Software

Version

ClickHouse

v22.5.1.2079-stable

Operation System

Amazon Linux 2

Kernel

5.10.112-108.499.amzn2.aarch64
5.10.112-108.499.amzn2.x86_64

ClickHouse server parameter

value

max_threads

vCPU number

Note: the 'max threads' parameter specifies the number of worker threads for parallel query processing on ClickHouse server; the default value is the number of physical CPU cores. When using this default 'max threads' setting, C7g instances outperform C6i instances by 40%. But up to half of the entire CPU resource are idle in C6i instances while C7g instances are fully utilized. To fully utilize the CPU resource on C6i, we set the 'max threads' value to the vCPU number on C7g and C6i instances in this comparison.

Query Time Test

We use the web analytics dataset (“hits” table containing 100 million rows) and 43 typical queries to collect query processing time, which is provided by official benchmark method.

For each of these 43 typical queries, the average query time is the arithmetic mean of 10 consecutive queries after one warmup query. The total query time, as shown in the following tables, is the sum of the average time of these 43 queries. We observed 25.8% performance uplift by running ClickHouse on C7g instances compared to running on C6i instances.

The following table shows total query processing time (lower is better) comparison between C7g and C6i.

Instance Size

C7g (Sec)

C6i (Sec)

Performance gain

2xlarge

34.95

42.77

18.3%

4xlarge

18.91

24.57

23.0%

8xlarge

11.72

15.57

24.8%

16xlarge

12.16

25.8%

Table 1. ClickHouse query processing time benchmark results on C7g vs C6i

Figure 1. Query time performance gains for C7g vs. C6i

Figure 1. Query time Performance gains for C7g vs. C6i

We also selected the 3 most significant queries (Query 19, Query 33, Query 34) that consume more processing time, to observe the performance uplift on C7g instances compared to C6i instances.

Query 19

SELECT UserID, toMinute(EventTime) AS m, SearchPhrase, count() FROM hits_100m_obfuscated GROUP BY UserID, m, SearchPhrase ORDER BY count() DESC LIMIT 10;

Query 33

SELECT WatchID, ClientIP, count() AS c, sum(Refresh), avg(ResolutionWidth) FROM hits_100m_obfuscated GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;

Query 34

SELECT URL, count() AS c FROM hits_100m_obfuscated GROUP BY URL ORDER BY c DESC LIMIT 10;

The following tables show the result of the top 3 complex queries, comparing between C7g and C6i instances. (Lower is better)

Instance Size

C7g (sec)

C6i (sec)

Performance gain

2xlarge

3.995

4.918

18.8%

4xlarge

2.002

2.736

26.8%

8xlarge

1.101

1.558

29.3%

16xlarge

0.690

1.010

31.7%

Table 2. Query 19 results on C7g vs C6i

Figure 2. Query 19 performance gains for C7g vs. C6i instancesFigure 2. Query 19 Performance gains for C7g vs. C6i instances

Instance Size

C7g (Sec)

C6i (Sec)

Performance gain

2xlarge

4.562

4.947

4xlarge

2.351

2.816

16.5%

8xlarge

1.578

2.107

25.1%

16xlarge

1.137

1.608

29.3%

 Table 3. Query 33 results on C7g vs C6i

Figure 3. Query 33 performance gains for C7g vs. C6i instancesFigure 3. Query 33 Performance gains for C7g vs. C6i instances

Instance Size

C7g (Sec)

C6i (Sec)

Performance gain

2xlarge

3.225

3.766

14.4%

4xlarge

1.793

2.171

17.4%

8xlarge

1.066

1.325

19.6%

16xlarge

0.774

1.036

25.4%

Table 4. Query 34 results on C7g vs C6i

Figure 4. Query 34 performance gains for C7g vs. C6iFigure 4. Query 34 Performance gains for C7g vs. C6i instances

Throughput Test

We used the official ClickHouse benchmark tool to collect throughput data based on the same dataset and queries. After a warmup phase, each test will use the benchmark tool to continuously send all 43 typical queries to the server, reporting queries per second (QPS) by the end of test. We observed a 31.6% performance uplift by running ClickHouse on C7g instances compared to running on C6i instances.

The following table shows the QPS (higher is better) comparison for the default single connection scenario (clickhouse-benchmark --concurrency=1) on C7g and C6i.

Instance Size

C7g (Queries/Sec)

C6i (Queries/Sec)

Performance gain

2xlarge

0.684

0.581

17.7%

4xlarge

2.249

1.738

29.4%

8xlarge

3.529

2.709

30.3%

16xlarge

4.536

3.446

31.6%

Table 5. ClickHouse throughput performance results (single connection) on C7g vs C6i

Figure 5. ClickHouse throughput performance gain (single connection) for C7g vs. C6i

Figure 5. ClickHouse throughput performance gain (single connection) for C7g vs. C6i instances

The following table shows the QPS comparison for a multi-connection scenario (clickhouse-benchmark --concurrency=N) on C7g and C6i. (note: xlarge/2xlarge/4xlarge instances cannot support multi-connection due to a memory capacity limit)

Instance Size

Concurrency

C7g (Queries/Sec)

C6i (Queries/Sec)

performance gain

8xlarge

4.125

2.968

39.0%

4.138

2.931

41.2%

4.182

2.947

41.9%

4.108

2.914

41.0%

16xlarge

5.847

4.003

46.1%

6.195

4.071

52.2%

6.329

4.093

54.6%

6.290

4.112

53.0%

Table 6. ClickHouse throughput performance results (multi connection) on C7g vs C6i

Figure 6. ClickHouse throughput performance gain (multi connection) for C7g vs. C6i

Figure 6. ClickHouse throughput performance gain (multi connection) for C7g vs. C6i instances

Conclusion

In addition to a 20% instance price savings, by deploying on AWS Graviton3-based C7g instances ClickHouse has seen query latency (processing time) reduced by 26% and throughput performance increased by 32%. This comparison is over equally configured 3rd generation Xeon Scalable processor-based instances.

Visit the AWS Graviton3 page for customer stories on adoption of Arm-based processors. For details on how to migrate existing applications to AWS Graviton, please check this GitHub page. For any queries related to your software workloads running on Arm Neoverse platforms, feel free to reach out to us at [email protected].


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK