2

自动化运维--网卡link监控及告警

 1 year ago
source link: https://blog.51cto.com/u_14009921/5820644
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
自动化运维--网卡link监控及告警_监控

二、prometheus 部署

去官网下载一个对应平台的安装包​ ​https://prometheus.io/download/​

下载2.37.1 release版本

[root@localhost monitor]# wget https://github.com/prometheus/prometheus/releases/download/v2.37.1/prometheus-2.37.1.linux-amd64.tar.gz

下载后解压

[root@localhost monitor]# tar zxvf prometheus-2.37.1.linux-amd64.tar.gz

把prometheus的服务写成系统服务

[root@localhost monitor]# mv /root/monitor/prometheus-2.37.1.linux-amd64/prometheus /usr/local/bin/

[root@localhost monitor]# cat <<EOF > /usr/lib/systemd/system/prometheus.service
[Unit]
Description=prometheus

[Service]
Type=simple
ExecStart=/usr/local/bin/prometheus --config.file=/root/monitor/prometheus-2.37.1.linux-amd64/prometheus.yml --web.enable-lifecycle
SuccessExitStatus=143
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target
EOF

加上执行权限

chmod 755 /usr/lib/systemd/system/prometheus.service

开机自启动服务

systemctl start prometheus
systemctl enable prometheus

IP:9090即可登录prometheus web

自动化运维--网卡link监控及告警_linux_02

三、alertmanager部署

下载 ​ ​https://prometheus.io/download/​

wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
[root@localhost monitor]# tar zxvf alertmanager-0.24.0.linux-amd64.tar.gz

安装成系统服务

[root@localhost monitor]# mv alertmanager-0.24.0.linux-amd64/alertmanager /usr/local/bin

cat <<EOF > /usr/lib/systemd/system/alertmanager.service
[Unit]
Descriptinotallow=alertmanager

[Service]
Type=simple
ExecStart=/usr/local/bin/alertmanager --cluster.advertise-address=0.0.0.0:9093 --config.file=/root/monitor/alertmanager-0.24.0.linux-amd64/alertmanager.yml
SuccessExitStatus=143
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target
EOF
chmod 755 /usr/lib/systemd/system/alertmanager.service

开机自启动

systemctl start alertmanager.service
systemctl enable alertmanager.service

web登录alertmanager

ip:9093

自动化运维--网卡link监控及告警_网络_03

四、grafana部署

[root@localhost monitor]# wget https://dl.grafana.com/enterprise/release/grafana-enterprise-9.2.3-1.x86_64.rpm
[root@localhost monitor]# yum localinstall grafana-enterprise-9.2.2-1.x86_64.rpm -y
systemctl start grafana-server
systemctl enable grafana-server

web登录

IP:3000 默认用户名密码为admin/admin

五、客户端node_exporter部署

wget https://github.com/prometheus/node_exporter/releases/download/v1.4.0/node_exporter-1.4.0.linux-amd64.tar.
gz
tar zxvf node_exporter-1.4.0.linux-amd64.tar.gz

也可以写成系统服务,简单运行的话直接运行在后台即可

​./node_exporter &​

在服务端配置该客户端的监听

vim prometheus-2.37.1.linux-amd64/prometheus.yml
- job_name: "Nic Monitor"
static_configs:
- targets: ["192.168.31.214:9100"]

已经监控生效

自动化运维--网卡link监控及告警_网络_04

六、配置grafana

配置数据源,这里没有额外用influxdb,直接选择prometheus即可

自动化运维--网卡link监控及告警_监控_05

配置dashboard,去https://grafana.com/grafana/dashboards/下载自己需要的模板,然后导入

自动化运维--网卡link监控及告警_linux_06

可以自己自定义修改模板

七、配置告警

prometheus通过PromQL设置自己需要的监控项,根据对监控数据做运算后得出想要的监控项,并发送给alertmanager进行路由处理。

prometheus.yml增加配置

​rule_files:​

​ - "/root/monitor/rules/*.rules"​

在该目录下自定义各类rules

自定义rules规则,这些固定下来就不需要动了,当alert状态到Firing的时候就会发送到alertmanager

配置网卡link的检查项(后续可以check其他项,如CPU,内存,流量等)

groups:
- name: Link_status
rules:
# Alert for any instance that is unreachable for >1 minutes.
- alert: LinkDown
expr: node_network_up == 0
for: 1m
labels:
severity: 高
annotations:
summary: "the NIC {{ $labels.device }} of SERVER {{ $labels.instance }} is down"
自动化运维--网卡link监控及告警_linux_07

配置告警设置

因为触发了alert后,prometheus会发送到alertmanager

在prometheus.yml文件中配置alertmanager

alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']

在alertmanager配置router和receivers。

alertmanager支持对告警消息的分组,抑制和静默。

可以匹配alert里面的各类标签进行分组,并路由到不同的receiver去。

这里没什么需求的话就设置一个顶部路由即可。

附alertmanager配置文件和邮件模板

alertmanager.yml

global:
resolve_timeout: 5m
smtp_from: '发件的邮箱'
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: '改成你的邮箱'
smtp_auth_password: '改成你邮箱的密码'
smtp_require_tls: false
smtp_hello: 'qq.com'
templates:
- '/root/monitor/alertmanager-0.24.0.linux-amd64/email.tmpl'
route:
group_by: ['device']
group_wait: 10s
group_interval: 1m
repeat_interval: 1h
receiver: 'manager'

receivers:
- name: 'manager'
email_configs:
- to: [email protected]
headers: { Subject: " 【告警信息】 {{ .CommonLabels.alertname }} " }
html: '{{ template "email.to.html" . }}'
send_resolved: true

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
{{ define "email.from" }}管理员{{ end }}
{{ define "email.to" }}[email protected]{{ end }}
{{ define "email.to.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
======== 异常告警 ========<br>
告警名称:{{ $alert.Labels.alertname }}<br>
告警级别:{{ $alert.Labels.severity }}<br>
告警机器:{{ $alert.Labels.instance }}<br>
告警网卡:{{ $alert.Labels.device }}<br>
告警详情:{{ $alert.Annotations.summary }}<br>
告警时间:{{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
========== END ==========<br>
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
======== 告警恢复 ========<br>
告警名称:{{ $alert.Labels.alertname }}<br>
告警级别:{{ $alert.Labels.severity }}<br>
告警机器:{{ $alert.Labels.instance }}<br>
告警网卡:{{ $alert.Labels.device }}<br>
告警详情:{{ $alert.Annotations.summary }}<br>
告警时间:{{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
恢复时间:{{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
========== END ==========<br>
{{- end }}
{{- end }}
{{- end }}
自动化运维--网卡link监控及告警_linux_08
自动化运维--网卡link监控及告警_告警_09
自动化运维--网卡link监控及告警_linux_10

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK