+
+# 17 minutes: We try to send every 5 minutes. if we reboot causing 1
+# send to fail, thats 10 minutes between 2 sends. we test this every 5
+# minutes, so thats 15 minutes of time we can expect for 1 failed email,
+# and 1 failed email is expected due to reboots or other tiny issues we
+# dont care about.
+#
+# cmc_wan_down etc, inhibits other alerts, but mailtest_check needs
+# additional time to recover after an outage. We can only inhibit while
+# an alert is actually firing, it doesnt affect the "for:"
+# condition. So, we have those alerts that need to be delayed be
+# conditioned on a query for that alert having not been firing in the
+# last X minutes. However, there is a special case when prometheus
+# itself was down, and so there was no alert. So, I test for missing
+# of metric that gets generated for prometheus itself. If for some
+# reason that has a problem, I could make it more conservative by
+# checking that we booted recently instead, eg:
+# time() - node_boot_time_seconds{instance="kdwg:9101"} <= 60 * 17
+ - alert: mailtest_check_vps
+ expr: |-
+ time() - mailtest_check_last_usec{job="tlsnode"} >= 60 * 17 unless on() mailtest_lag_inhibit
+ labels:
+ severity: day