8

基于Prometheus的监控告警系统的Python开发

 1 year ago
source link: https://blog.51cto.com/lee90/5951213
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

基于Prometheus的监控告警系统的Python开发

精选 原创

我的二狗呢 2022-12-18 22:36:53 博主文章分类:linux ©著作权

文章标签 告警系统 文章分类 Linux 系统/运维 阅读数163

周末外面太冷,在家搞了下Prometheus的白屏化运维DEMO。目前只是把后端简单的几个接口搞出来,校验之类的还没加。。。

这里先记录下。 后续等后端完成后,把前端也尝试写一下。

基于Prometheus的监控告警系统的Python开发_告警系统
基于Prometheus的监控告警系统的Python开发_告警系统_02
基于Prometheus的监控告警系统的Python开发_告警系统_03
基于Prometheus的监控告警系统的Python开发_告警系统_04

重点:

1、prometheus的target,是存在数据库里面的,只要符合一定的格式即可。 prometheus很早之前就支持了http接口方式动态target发现机制。格式类似这样:

基于Prometheus的监控告警系统的Python开发_告警系统_05

prometheus的配置文件,需要改动下,加些relabel,如下:

$ cat /usr/local/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 15s

alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.31.181:9093
rule_files:
- "rules/*.yml"
# - "rules/*.yaml"

scrape_configs:
- job_name: "alertcenter_api"
metrics_path: "/metrics"
http_sd_configs:
- url: "http://192.168.31.79:8000/api/prom/prom_targets"
refresh_interval: 30s
relabel_configs:
- source_labels:
- "__meta_datacenter"
separator: "-"
regex: "(.*)"
target_label: "datacenter"
action: replace
replacement: "$1"
- source_labels:
- "__meta_prometheus_job"
separator: "-"
regex: "(.*)"
target_label: "job"
- source_labels:
- "__meta_role"
separator: "-"
regex: "(.*)"
target_label: "role"
- source_labels:
- "__meta_cluster"
separator: "-"
regex: "(.*)"
target_label: "cluster"
- source_labels:
- "__meta_instance"
separator: "-"
regex: "(.*)"
target_label: "instance"
- source_labels:
- "__address__"
separator: "-"
regex: "(.*)"
target_label: "endpoint"

2、告警的rules,也是存在数据库里面的,根据库的数据,渲染成json,然后转成yaml格式的文件,apply到prometheus里面生效。

基于Prometheus的监控告警系统的Python开发_告警系统_06

3、alertmanager告警。配置个webhook。大致这样:

$ cat /usr/local/alertmanager-0.23.0.linux-amd64/alertmanager.yml
global:
resolve_timeout: 30s

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 30s
repeat_interval: 30s
receiver: 'webhook1'

routes:
- match:
job: ^.*(数据库|mysql|MySQL).*$
receiver: dba
group_wait: 10s
group_interval: 30s
repeat_interval: 30s
- match_re:
job: ^.*(数据库|mysql|MySQL).*$
group_wait: 30s
group_interval: 30s
repeat_interval: 30s
receiver: dba

receivers:
- name: webhook1
webhook_configs:
- send_resolved: true
url: http://192.168.31.79:8000/api/prom/test
- name: dba
webhook_configs:
- send_resolved: true
url: http://192.168.31.79:8000/api/prom/test

post的接口这里做了很多事情,大致步骤:1、接收到alertmanager推送的消息(目前看是分为2类:firing告警、resolved恢复)。2、调用selenium访问prometheus的web ui,进行截图。3、截图上传到腾讯云oss,生成一个固定的公开访问链接。4、发送钉钉告警消息,带上文字内容和截图。类似如下:

基于Prometheus的监控告警系统的Python开发_告警系统_07
基于Prometheus的监控告警系统的Python开发_告警系统_08

告警这块还要做的事情很多,例如:

1、critical的告警,需要有个确认按钮,如果没人确认,则持续N次后,会触发告警升级(一线->leader->总监)

2、告警静默的时间段(有些job,在夜里跑批可能负载很高,持续告警也没任何意义)

3、告警的合并

4、自定义告警接收人

5、可接入非alertmanager推送的告警,例如shell脚本运行异常触发告警


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK