There's a conversation that happens in engineering leadership circles that rarely makes it into public postmortems. It goes something like this: "We lost two of our best people last quarter. Both cited on-call."
Hiring managers nod, then move on. The incident queue fills back up. The rotation shrinks. The surviving engineers absorb more shifts. Repeat.
What's striking isn't that this happens — it's how little most organizations have quantified it. On-call burnout is treated as an inevitable cost of running production systems, a soft cultural problem that HR might address with wellness stipends or shift swaps. In reality, it's a hard financial problem hiding in plain sight.
The Anatomy of Alert Fatigue
Let's start with what engineers actually experience. The average on-call engineer receives more than 30 alerts per shift. Of those, industry data suggests up to 67% require no action — they're noise. False positives. Monitoring thresholds set too aggressively two years ago that no one has had time to tune.
This isn't a minor inconvenience. It's a systematic erosion of trust. When an engineer's phone buzzes at 2am, they no longer believe it's critical. They delay. They silence. And then the one alert that actually matters gets the same treatment.
"The worst outcome of alert fatigue isn't burnout. It's the engineer who stops taking pages seriously — right before the one that matters."
Beyond the noise, there's the diagnosis problem. Mean Time to Detect (MTTD) regularly exceeds Mean Time to Repair (MTTR) — meaning engineers spend more time figuring out what is broken than actually fixing it. The culprits are predictable: siloed observability tools, tribal knowledge locked in the heads of your most senior (and most burned-out) engineers, and no single pane of glass across database, OS, and application layers.
The Recurring Incident Trap
Here's the pattern that defines on-call dysfunction at scale: the same incident fires on Monday, gets a band-aid fix at 1am, closes, and fires again the following Monday.
Root cause analysis gets scheduled. It gets bumped. Someone writes a Jira ticket. The ticket ages. The incident fires again.
- Alert fires — engineer is paged, interrupts sleep or deep work
- Triage consumes 30–90 minutes navigating disconnected tools
- Workaround applied — the fix that isn't a fix
- Incident closed — the ticket that never gets prioritized
- Cycle repeats, often weekly, often indefinitely
This loop has a compounding effect on your team. Each repetition confirms the implicit message that the organization doesn't value their time. Each repetition makes the next engineer exit a little more likely.
What This Actually Costs
The financial reality of on-call dysfunction is rarely presented in a single number. Here's what it looks like for a mid-size technology company running 10 on-call engineers:
Direct downtime (10 P1 incidents/year)$500K – $2M
Engineer labor during incidents$200K – $500K
On-call productivity tax (20% reduced output)$400K
Alert fatigue / false positive triage$100K – $150K
Recurring incidents never root-caused$500K – $1M
Turnover from burnout (1–2 engineers/year)$260K – $620K
Total Annual On-Call Cost
$2M – $4.7M
Two items in this table deserve particular attention: the productivity tax and the turnover cost.
The productivity tax is invisible in most engineering metrics. On-call engineers aren't just losing sleep during incidents — they're operating at roughly 80% capacity the day after a page, the week of a bad rotation, the month following a major outage. That 20% reduction in output compounds across your entire on-call rotation, quarter after quarter.
The turnover cost is the one that leadership tends to undercount. At a fully-loaded replacement cost of $310,000 per engineer — recruiting, onboarding, ramp time, lost institutional knowledge — losing even one engineer per year to on-call burnout is an eight-figure problem over a three-year horizon.
The Uncomfortable Truth
Organizations that treat on-call burnout as a culture problem will keep writing wellness checks. Organizations that treat it as a financial problem will fix the underlying system. The difference shows up in your retention numbers within two quarters.
The Human Cost Behind the Numbers
Numbers are clean. The reality isn't.
On-call burnout doesn't announce itself with a resignation letter. It builds slowly: the engineer who stops raising concerns in postmortems because nothing changes anyway. The senior who opts out of the promotion track because the scope would mean more on-call exposure. The quiet performer who starts interviewing not because they're chasing salary, but because they just want to sleep.
Forty-two percent of engineers who leave cite on-call burden as a primary driver. That's not a fringe statistic — it represents nearly half of your attrition risk sitting directly inside a solvable systems problem.
The engineers who stay aren't unaffected. Shrinking rotations mean higher per-person load. Higher load accelerates the same burnout cycle in whoever remains. It's a self-reinforcing drain on your most operationally critical people.
What Leaders Get Wrong
The instinct when confronted with on-call burnout is to add headcount to the rotation, reduce SLA commitments, or invest in better runbooks. These aren't bad ideas — but they treat the symptom.
The root cause is that humans are being asked to respond to problems that either shouldn't be alerting at all, or that the system already knows how to fix. The cost isn't the 3am wake-up — it's the structure that makes that wake-up necessary.
Organizations spending $2–5M annually on this problem aren't doing so because they lack talent or discipline. They're doing so because the tooling, processes, and observability infrastructure haven't caught up with the scale and complexity of the systems their engineers are asked to maintain.
The Starting Point
If you're not sure where you stand, start with a simple audit: pull the last 30 days of incident data. For each page, ask whether a human decision was actually required, or whether a known remediation script could have handled it. Most organizations find the answer is humbling — and clarifying.
The goal isn't to eliminate on-call. Production systems require human oversight. The goal is to ensure that when an engineer is paged, it's because a human judgment call is genuinely needed — not because no one got around to automating the fix for an incident that's fired 47 times this year.
Your engineers already know which incidents those are. Ask them.
Contact Us
The first step toward a healthier on-call culture is a single conversation. Reach out to our team today and let's talk about what's possible for your organization. Contact the Arisant experts at 303-330-4065 or email us at polaris_monitoring@arisant.com
Polaris is a monitoring tool with self-healing capabilities that can augment or replace your current monitoring solution. Please reach out if you'd like additional information.
This analysis draws on industry research from Gartner, PagerDuty, and first-party incident data. Cost estimates reflect conservative projections for a mid-size technology organization operating with 10 on-call engineers.