TechEarl

55 Monitoring and Alert Fatigue Jokes Only SREs Will Get

Monitoring and alert fatigue jokes on flapping pages, silenced alerts that fire anyway, empty runbooks, PromQL dreams, and 3am self-recovery notes.

Ishan Karunaratne⏱️ 3 min readUpdated
Share thisCopied

55 Monitoring and Alert Fatigue Jokes

My phone buzzed. I flinched before I read it. It was a calendar reminder.

"Did you see the alert?" "Which one?"

The dashboard has 47 panels. I watch two.

Severity levels at my company: P1, P2, P3, and "this one is real."

The runbook says: "If this fires, see the owner." The owner left in 2021.

Alert: CPU at 92%. Me: that's a Tuesday.

I muted the channel. The channel was the on-call channel.

"Why didn't anyone respond?" Because the same alert has fired hourly since March.

The alert was silenced. It fired anyway. I respect its commitment.

Disk full at 3:14 a.m. Disk fine at 3:17 a.m. Nobody acknowledged. This happens every night. Nobody investigates.

"It's just a warning." The warning has been firing for six months.

I added a new monitor. I deleted three to stay sane.

PagerDuty's most useful feature is the snooze button.

The alert title: "Something is wrong." Great. Thanks.

We have alerts for the alerts.

"The monitor is broken." "How do you know?" "It hasn't paged in a week."

I was paged for a service I have never heard of.

The dashboard is green. The users are calling. One of these is lying.

On-call onboarding: "Here is the runbook." "It's empty." "Good luck."

"Did you check the graph?" "Which one?" "Any of them."

I have four monitoring tools. Each disagrees with the other three.

The alert fired because the metric reporter died. The service was fine.

"This is a noisy alert." The ticket has been open for two years.

I dreamed in PromQL again.

"It self-recovered." The most common resolution note in the company.

Every dashboard has a panel labeled "misc" that secretly runs the business.

Anomaly detection flagged Sunday as an anomaly. It was Sunday.

The on-call handoff is one sentence: "Good luck with the queue."

I built a status page. It has its own status page.

The vendor said: "You'll never miss an alert again." They were right. I miss none of them. I also sleep none.

"What does this alert mean?" "Nobody knows. It came with the platform."

The threshold was set in 2019. The service grew tenfold. The threshold did not.

I clicked acknowledge. It fired again before the modal closed.

The escalation policy ends at "call CEO." Nobody has ever reached step five. Nobody wants to find out what happens.

"Why is this alert P1?" "Someone got paged on a weekend once and made it P1."

Three monitors. One service. One page each. Four minutes of phone vibrating.

I trust the customer ticket more than the dashboard.

"The SLO is at 99.9%." The users are at 100% angry.

The healthiest service in the company is the one nobody monitors. For now.

Synthetic monitor fails. Users fine. It's the synthetic. It's always the synthetic.

The alert message is "see playbook." The playbook says "see alert."

"Can we make this less noisy?" Three months later: same noise, new dashboard.

I built alert fatigue dashboards to track our alert fatigue.

We measured time-to-acknowledge. It got worse the more we measured.

The on-call sleep schedule: Phone face up. Brightness max. Volume max. Dignity zero.

"Is this a real one?" The only question that matters at 3 a.m.

The alert title was in Latin. I Googled it. It was a Kubernetes default.

We have 14 different definitions of "healthy."

The dashboard was beautiful in the demo. It has not been opened since.

"Are we under SLO?" "Define SLO."

Postmortem action item: "Improve monitoring." Number of postmortems with that exact line: all of them.

I unsubscribed from one alert channel. My Slack got 3% quieter. I was paged for what I missed.

"It's flapping." The word that excuses six months of inaction.

An alert fired with the description: "This shouldn't happen." It happens daily.

My favorite alert resolution note is one word. "Yes."

Why monitoring jokes write themselves

The original promise of monitoring was visibility. The lived experience is noise. Every team I have worked on starts by adding a few sensible alerts, then adds a few more after every incident, then never removes any, and ends up with a paging policy that runs on superstition. The dashboard grows panels the way a hoarder's garage grows boxes. Nobody is willing to delete the alert that fired once in 2020 and saved the company, even though it has fired three thousand times since for unrelated reasons.

The Google SRE book has a whole chapter on this, and the gist of it is that your pager should only ever ring for something a human needs to fix right now. In practice, most teams treat the pager as a notification feed. The result is the on-call engineer who acknowledges nine pages an hour, none of which they read, and misses the tenth which was the real one. Alert fatigue is not laziness. It is the predictable outcome of a system where every signal has the same volume.

The dark joke under all of this is that the fix is unglamorous. You delete alerts. You raise thresholds. You write the runbook the bored future-you will actually follow at 3 a.m. None of that ships features. None of it gets praise in the all-hands. So it does not happen, and the page fires again, and somebody else writes a postmortem with the line "improve monitoring" in the action items, and the cycle continues.

See also

Sources

Authoritative references this article was fact-checked against.

TagsHumorJokesMonitoringAlertingSREOn-callObservability

Found this useful? Pass it on.

Copied

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years building software, Linux systems, and DevOps infrastructure, and lately working AI into the stack. Currently Chief Technology Officer at a healthcare tech startup, which is where most of these field notes come from.

Keep reading

Related posts