Scaling Percona Monitoring and Management (PMM)

Starting withPMM 1.13, PMM uses Prometheus 2 for metrics storage, which tends to be heaviest resource consumer of CPU and RAM. With Prometheus 2 Performance Improvements, PMM can scale to more than 1000 monitored nodes per instance in default configuration. In this blog post we will look into PMM scaling and capacity planning—how to estimate the resources required, and what drives resource consumption.

iMbIBza.png!web

We have now tested PMM with up to 1000 nodes, using a virtualized system with 128GB of memory, 24 virtual cores, and SSD storage. We found PMM scales pretty linearly with the available memory and CPU cores, and we believe that a higher number of nodes could be supported with more powerful hardware.

What drives resource usage in PMM ?

Depending on your system configuration and workload, a single node can generate very different loads on the PMM server. The main factors that impact the performance of PMM are:

Number of samples (data points) injected into PMM per second
Number of distinct time series they belong to (cardinality)
Number of distinct query patterns your application uses
Number of queries you have on PMM, through the user interface on the API, and their complexity

These specifically can be impacted by:

Software version – modern database software versions expose more metrics)
Software configuration – some metrics are only exposed in certain configuration
Workload – a large number of database objects and high concurrency will increase both the number of samples ingested and their cardinality.
Exporter configuration – disabling collectors can reduce amount of data collectors
Scrape frequency – controlled by METRICS_RESOLUTION

All these factors together may impact resource requirements by a factor of ten or more, so do your own testing to be sure. However, the numbers in this article should serve as good general guidance as a start point for your research.

On the system supporting 1000 instances we observed the following performance:

6j2q2mM.png!web

As you can see, we have more than 2.000 scrapes/sec performed, providing almost two million samples/sec, and more than eight million active time series. These are the main numbers that define the load placed on Prometheus.

Capacity planning to scale PMM

Both CPU and memory are very important resources for PMM capacity planning. Memory is the more important as Prometheus 2 does not have good options for limiting memory consumption. If you do not have enough memory to handle your workload, then it will run out of memory and crash.

We recommend at least 2GB of memory for a production PMM Installation. A test installation with 1GB of memory is possible. However, it may not be able to monitor more than one or two nodes without running out of memory. With 2GB of memory you should be able to monitor at least five nodes without problem.

With powerful systems (8GB of more) you can have approximately eight systems per 1GB of memory, or about 15,000 samples ingested/sec per 1GB of memory.

To calculate the CPU usage resources required, allow for about 50 monitored systems per core (or 100K metrics/sec per CPU core).

One problem you’re likely to encounter if you’re running PMM with 100+ instances is the “Home Dashboard”. This becomes way too heavy with such a large number of servers. We plan to fix this issue in future releases of PMM, but for now you can work around it in two simple ways:

You can select the host, for example “pmm-server” in your home dashboard and save it, before adding a large amount of hosts to the system.

VvuiArj.png!web

Or you can make some other dashboard of your choice and set it as the home dashboard.

Summary

More than 1,000 monitored systems is possible per single PMM server
Your specific workload and configuration may significantly change the resources required
If deploying with 8GB or more, plan 50 systems per core, and eight systems per 1GB of RAM

What drives resource usage in PMM ?

Capacity planning to scale PMM

Summary

Recommend

Building With Workers KV, a Fast Distributed Key-Value Store

Introducing Workers KV

Multi-Cloud Kubernetes 最佳实践

Maven Profile 与 Spring Profile 管理多环境打包

Apache Spark Architecture – Spark Cluster Architecture Explained

Explore the New Java 10 “var” Type: An Introduction and Hands-on Tutorial

【go源码分析】strings.go 里的那些骚操作

在线分析诊断工具Arthas简介及使用

我准备重仓一次蓝色光标转债

沸腾新十年 | 两个李想和他的一个理想

About Joyk