Observability Tips and Tricks For Using Grafana and Prometheus

dejanualex

seasoned devops && open source enthusiast && @GitKraken ambassador

“Dashboard anything. Observe everything.”

Overview

Recently I’ve started working in a project heavy focused on observability and monitoring in which the Prometheus configuration and all the Grafana dashboards and alarms were implemented by someone else.

Basically I was blindfolded working with a black-box in terms of application know-how and monitoring perspective.

Foundation

Some important concepts before going further:

Prometheus stores data in a time-series data model, basically streams of values associated with a timestamp.
Prometheus scrapes endpoints also known as instances and a collection of instances form a job.

Every time-series is uniquely identified by its metric name and optional key-value pairs called labels.

#<metric name>{<label name>=<label value>, ...}
kube_node_labels{cluster="aws-01", label_kubernetes_io_role="master"}

Grafana supports many different storage backends for your time-series data (data source). We will focus on Prometheus.
Grafana officially data sources

Start exploring

Get all labels

As I said we know close to zero things about our metrics, therefore some probing (in terms of metrics and labels) is required. For this we can create a new ephemeral dashboard, and tinker with query type variable.

So let’s say we want to get all the labels…kind of greedy.

label_names()

We have the following options:

label_names()	Returns a list of label names.
label_values(label)	Returns a list of label values for the label in every metric.
label_values(metric, label)	Returns a list of label values for the label in the specified metric.
metrics(metric)	Returns a list of metrics matching the specified metric regex.
query_result(query)	Returns a list of Prometheus query result for the query.

Keep in mind that label function doesn’t support queries and isn't a part of Prometheus functions, but a part of Grafana templating (that’s why we created a new dashboard).

We can get all the labels also using Prometheus API endpoint /api/v1/labels:

Prometheus API

Extract all the values for a particular label
Let’s assume that we want all the values for dockerVersion label we can do this using Grafana label_values function or using Prometheus API endpoint /api/v1/label/<label_name>/values.

Label Values

Get all the jobs
Prometheus API exposes the following endpoint /api/v1/label/job/values in order to gather all the jobs scraped by that particular instance.
Prometheus jobs endpoint
Quick check if the instances are healthy/reachable - automatically generated labels and time-series
When Prometheus scrapes a target, it attaches some labels automatically to the scraped time series which serve to identify the scraped target.
```
# up time series is useful for instance availability monitoring
up{job="<job-name>", instance="<instance-id>"}
```
Going further we also can check all the labels for a particular job using the up metric.

Job fluent-bit labels and value

Conclusions

Grafana and Prometheus are frequently used in organizations, and they’re such a useful monitor-stack because despite the fact that at a certain point we might not know all the implementation details concerning metric collection, they allow us to cast some light using the aforementioned tips.

Grafana

Prometheus API

label_names()

curl -s http://127.0.0.1:9091/api/v1/labels

label_values(<label_name>)

label_values(<metric_name>,<label_name>)

curl -s http://127.0.0.1:9091/api/v1/label/<label_name>/values

up{job="<job_name>"}

curl -s http://127.0.0.1:9091/api/v1/label/job/values

Overview

Foundation

Start exploring

Conclusions

Recommend

Instagram's now allows users to add music on their feed posts In India, Brazil A...

代码管理从 SVN 迁移到 Git，目录权限控制应该如何配置？

Machine Learning Can Also Scale Misleading Terms, Unwanted Data Sharing, and Au...

请教一下大家关于 dnsmasq 查找上游 DNS 的两个问题

Uni-App 这次要怎么解释，脑壳疼

开 DDNS 会被停宽带吗？

The Tools I Use for My Pre-Revenue Startup [BLACK FRIDAY EDITION]

India looking to tighten regulations on crypto trading and holding: Report

服务器 CPU 长期保持在 80%以上，会有什么影响？

XDM,要优化一套站内信，请问有什么可执行落地的方案。

About Joyk