最近把Prometheus監控遷移到了kubernetes集群中,部署文檔參考《Kubernetes環境使用Prometheus Operator自發現監控SpringBoot》,各類監控項的數據采集,以及grafana的監控展示測試都正常,于是進入下一步報警的遷入測試,alertmanager原生不支持釘釘報警,所以只能通過webhook的方式,好在已經有大佬開源了一套基于prometheus 釘釘報警的webhook(項目地址https://github.com/timonwong/prometheus-webhook-dingtalk),所以我們直接配置使用就可以了。
怎么創建釘釘機器人非常簡單這里就不介紹了,創建好釘釘機器人以后,下一步就是部署webhook,接收alertmanager的報警信息,格式化以后再發送給釘釘機器人。非kubernetes集群部署也是非常簡單,直接編寫一個docker-compose文件,直接運行就可以了。
1、在kubernetes集群中,pod之間需要通信,需要使用service,所以先編寫一個kubernetes的yaml文件dingtalk-webhook.yaml。
apiVersion: apps/v1 kind: Deployment metadata: name: webhook-dingtalk namespace: monitoring spec: selector: matchLabels: app: dingtalk replicas: 1 template: metadata: labels: app: dingtalk spec: restartPolicy: Always containers: - name: dingtalk image: timonwong/prometheus-webhook-dingtalk:v1.4.0 imagePullPolicy: IfNotPresent args: - '--web.enable-ui' - '--web.enable-lifecycle' - '--config.file=/config/config.yaml' ports: - containerPort: 8060 protocol: TCP volumeMounts: - mountPath: "/config" name: dingtalk-volume resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi volumes: - name: dingtalk-volume persistentVolumeClaim: claimName: dingding-pvc --- apiVersion: v1 kind: Service metadata: name: webhook-dingtalk namespace: monitoring spec: ports: - port: 80 protocol: TCP targetPort: 8060 selector: app: dingtalk sessionAffinity: None
1.1、第一種方式通過數據持久化,把配置文件config.yaml和報警模板放在了共享存儲里面,這樣webhook不管部署到哪臺node,都可以讀取到配置文件和報警模板。怎么通過NFS讓數據持久化可以參考文檔《Kubernetes使用StorageClass動態生成NFS類型的PV》。
dingding-pvc.yaml
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: dingding-pvc annotations: volume.beta.kubernetes.io/storage-class: "atang-nfs" namespace: monitoring spec: accessModes: - ReadWriteMany resources: requests: storage: 50Mi
配置文件config.yaml:
templates: - /config/template.tmpl targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=替換成自己的釘釘機器人token
報警模板template.tmpl:
{{ define "ding.link.title" }}[監控報警]{{ end }}
{{ define "ding.link.content" -}}
{{- if gt (len .Alerts.Firing) 0 -}}
{{ range $i, $alert := .Alerts.Firing }}
[告警項目]:{{ index $alert.Labels "alertname" }}
[告警實例]:{{ index $alert.Labels "instance" }}
[告警級別]:{{ index $alert.Labels "severity" }}
[告警閥值]:{{ index $alert.Annotations "value" }}
[告警詳情]:{{ index $alert.Annotations "description" }}
[觸發時間]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{ range $i, $alert := .Alerts.Resolved }}
[項目]:{{ index $alert.Labels "alertname" }}
[實例]:{{ index $alert.Labels "instance" }}
[狀態]:恢復正常
[開始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
[恢復]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{- end }}
{{- end }}
可以根據自己的喜歡自己修改模板,“.EndsAt.Add 28800e9”是UTC時間+8小時,因為prometheus和alertmanager默認都是使用的UTC時間,另外需要把這兩個文件的屬主和屬組設置成65534,不然webhook容器沒有權限訪問這兩個文件。
1.2、第二種方式通過configMap方式(推薦)掛載配置文件和模板,需要修改原來的dingtalk-webhook.yaml文件,添加掛載為configMap。
apiVersion: v1
kind: ConfigMap
metadata:
name: dingtalk-config
namespace: monitoring
data:
config.yaml: |
templates:
- /config/template.tmpl
targets:
webhook1:
url: https://oapi.dingtalk.com/robot/send?access_token=your_dingding_token
template.tmpl: |
{{ define "ding.link.title" }}[監控報警]{{ end }}
{{ define "ding.link.content" -}}
{{- if gt (len .Alerts.Firing) 0 -}}
{{ range $i, $alert := .Alerts.Firing }}
[告警項目]:{{ index $alert.Labels "alertname" }}
[告警實例]:{{ index $alert.Labels "instance" }}
[告警級別]:{{ index $alert.Labels "severity" }}
[告警閥值]:{{ index $alert.Annotations "value" }}
[告警詳情]:{{ index $alert.Annotations "description" }}
[觸發時間]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{ range $i, $alert := .Alerts.Resolved }}
[項目]:{{ index $alert.Labels "alertname" }}
[實例]:{{ index $alert.Labels "instance" }}
[狀態]:恢復正常
[開始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
[恢復]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{- end }}
{{- end }}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: dingding-webhook
namespace: monitoring
spec:
selector:
matchLabels:
app: dingtalk
replicas: 1
template:
metadata:
labels:
app: dingtalk
spec:
restartPolicy: Always
containers:
- name: dingtalk
image: timonwong/prometheus-webhook-dingtalk:v1.4.0
imagePullPolicy: IfNotPresent
args:
- '--web.enable-ui'
- '--web.enable-lifecycle'
- '--config.file=/config/config.yaml'
ports:
- containerPort: 8060
protocol: TCP
volumeMounts:
- name: dingtalk-config
mountPath: "/config"
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 100m
memory: 100Mi
volumes:
- name: dingtalk-config
configMap:
name: dingtalk-config
---
apiVersion: v1
kind: Service
metadata:
name: dingding-webhook
namespace: monitoring
spec:
ports:
- port: 80
protocol: TCP
targetPort: 8060
selector:
app: dingtalk
sessionAffinity: None
2、修改alertmanager默認的配置文件,增加webhook_configs,直接修改kube-prometheus-master/manifests/alertmanager-secret.yaml文件為以下內容:
apiVersion: v1
data: {}
kind: Secret
metadata:
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
"global":
"resolve_timeout": "5m"
"inhibit_rules":
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "critical"
"target_match_re":
"severity": "warning|info"
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "warning"
"target_match_re":
"severity": "info"
"receivers":
- "name": "m.maowutv.com"
#- "name": "Watchdog"
#- "name": "Critical"
#- "name": "webhook"
"webhook_configs":
- "url": "http://webhook-dingtalk/dingtalk/webhook1/send"
"send_resolved": true
"route":
"group_by":
- "namespace"
"group_interval": "5m"
"group_wait": "30s"
"receiver": "m.maowutv.com"
"repeat_interval": "12h"
#"routes":
#- "match":
# "alertname": "Watchdog"
# "receiver": "Watchdog"
#- "match":
# "severity": "critical"
# "receiver": "Critical"
所有的yaml文件準備好以后,執行
kubectl apply -f dingding-pvc.yaml kubectl apply -f dingtalk-webhook.yaml kubectl apply -f alertmanager-secret.yaml
查看執行結果

然后訪問alertmanager的地址(把alertmanager.amd5.cn替換為自己的地址)查看配置webhook_configs是否已經生效,http://alertmanager.amd5.cn/#/status。
3、生效以后,我們就添加報警規則,等待觸發規則閾值報警測試。
直接修改kube-prometheus-master/manifests/prometheus-rules.yaml在末尾添加下面的內容,注意縮進。
- name: prometheus-operator
rules:
- alert: PrometheusOperatorReconcileErrors
annotations:
message: Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace
}} Namespace.
expr: |
rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
for: 10m
labels:
severity: warning
- alert: PrometheusOperatorNodeLookupErrors
annotations:
message: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace.
expr: |
rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
for: 10m
labels:
severity: warning
#以下為添加的報警測試規則
- name: m.maowutv.com
rules:
- alert: '釘釘報警測試'
expr: |
jvm_threads_live > 140
for: 1m
labels:
severity: '警告'
annotations:
summary: "{{ $labels.instance }}: 釘釘報警測試"
description: "{{ $labels.instance }}:釘釘報警測試"
custom: "釘釘報警測試"
value: "{{$value}}"
然后執行命令更新規則
kubectl apply -f prometheus-rules.yaml
然后訪問prometheus地址http://prometheus.amd5.cn/alerts查看rule生效情況,如下圖:

等故障持續到我們設置規則時間后,釘釘就會收到報警:



