首页经验 prometheus告警情况分析

prometheus告警情况分析

经验 dawei 2021年5月11日

问题分析

最近运维prometheus的过程中发现，有的时候它应该发送告警，可实际却没有;有的时候，不该发送告警却发送了;还有的时候，告警出现明显的延迟。为了找出其中的具体原因，特地去查阅了一些资料，同时也参考了官网的相关资料。希望对大家在今后使用prometheus有所帮助。

先来看一下官网提供的prometheus和alertmanager的一些默认的重要配置。如下所示：

# promtheus
global:
# How frequently to scrape targets by default. 从目标抓取监控数据的间隔
[ scrape_interval: <duration> | default = 1m ]
# How long until a scrape request times out. 从目标住区数据的超时时间
[ scrape_timeout: <duration> | default = 10s ]
# How frequently to evaluate rules. 告警规则评估的时间间隔
[ evaluation_interval: <duration> | default = 1m ]
# alertmanager
# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ] # 初次发送告警的等待时间
# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ] 同一个组其他新发生的告警发送时间间隔
# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ] 重复发送同一个告警的时间间隔

通过上面的配置，我们来看一下整个告警的流程。通过流程去发现问题。

prometheus 分析告警情况最近运问题

dawei https://www.0553zz.cn/

【声明】：芜湖站长网内容转载自互联网，其相关言论仅代表作者个人观点绝非权威，不代表本站立场。如您发现内容存在版权问题，请提交相关链接至邮箱：bqsm@foxmail.com，我们将及时予以处理。