The Documentation System That Prevented an Outage

The Setup

When Priya joined Meridian Health’s infrastructure team as a Staff Engineer, she inherited 47 services with exactly zero runbooks. The on-call rotation was a nightmare—each incident became a war room where senior engineers were paged regardless of the hour because only they understood how systems connected.

“We were hemorrhaging institutional knowledge,” Priya recalled. “Every time someone left, we lost critical system understanding. And worse, our MTTR was 4 hours because even simple issues required archeology.”

The Insight

Most engineers would have started writing runbooks. Priya didn’t. Instead, she spent two weeks shadowing on-call shifts and analyzing incident reports. Her finding was counterintuitive: the problem wasn’t missing documentation—it was that documentation existed but couldn’t be found or trusted.

“We had Confluence pages from 2019. Google Docs that contradicted each other. READMEs that described systems that no longer existed. The team had learned to ignore documentation because it was usually wrong.”

The real problem was documentation decay. Without a system to keep docs current, they became dangerous—worse than no documentation at all.

The Solution

Priya proposed what she called “Living Documentation”—a controversial approach that tied documentation directly to the systems it described.

Core principles:

Docs as code: All documentation lived in the same repo as the service, reviewed in the same PR
Automated verification: CI pipelines tested that documented endpoints existed and commands worked
Ownership enforcement: Every doc file had a CODEOWNERS entry; stale docs blocked deployments
Freshness signals: Docs displayed last-verified dates; anything over 90 days triggered alerts

“The pushback was immediate,” Priya said. “Engineers argued that documentation shouldn’t block deploys. But that’s exactly backward—if your docs are so untrustworthy that you won’t let them block deploys, why have them at all?”

The Implementation

Priya didn’t mandate the new system. Instead, she picked her battle carefully: the payments service, which had the worst incident history.

Phase 1: Prove the concept (Weeks 1-4)

She wrote the first runbook herself, including executable verification tests. When a payment incident occurred at 2 AM, the on-call engineer resolved it in 20 minutes using the runbook—without escalation.

“That incident bought me credibility. The on-call engineer told the story at standup. Suddenly, other teams wanted what payments had.”

Phase 2: Build the tooling (Weeks 5-8)

Rather than asking teams to write docs, Priya built scaffolding:

Templates that autogenerated 60% of runbook content from service manifests
A CLI that extracted common commands from shell history
Pre-commit hooks that caught broken internal links

Phase 3: Create incentives (Weeks 9-12)

She worked with SRE leadership to add documentation scores to service maturity metrics. Teams with complete, verified docs got priority for new infrastructure features.

“I never asked anyone to write documentation. I made it easier to write than not write, and more valuable to have than not have.”

The Results

Six months later:

MTTR dropped from 4 hours to 45 minutes
Escalations reduced by 70%
New engineer onboarding time halved
Weekend pages requiring senior engineer involvement dropped from 80% to 15%

But the number that mattered most to Priya: zero incidents caused by outdated documentation.

The Leadership Lessons

1. Solve the system, not the symptom

“If I’d just written runbooks, they’d be outdated in six months. The system had to make current documentation the path of least resistance.”

2. Let results speak before mandates

“I could have pushed for a company-wide policy. But one successful incident resolution did more for adoption than any executive mandate.”

3. Remove friction before adding requirements

“Engineers don’t resist documentation—they resist documentation that’s harder to maintain than the code. Fix the tooling first.”

4. Find the force multiplier

“Writing 47 runbooks myself would have taken months and still left the decay problem. Building the system took 12 weeks and solved it permanently.”

5. Connect to pain points

“Nobody cares about documentation quality in the abstract. They care about 2 AM pages and 4-hour incident calls. Frame your solution in terms of pain they already feel.”

The Broader Pattern

Priya’s approach exemplifies a core staff engineer skill: systemic thinking. She didn’t solve one team’s documentation problem—she created a self-sustaining system that solved documentation across the organization.

“As staff engineers, our job isn’t to do the work. It’s to build systems that make the work easier for everyone. If you’re the only one who can maintain your solution, you’ve failed.”

Today, Priya is a Principal Engineer leading Meridian’s platform organization. Her documentation system has been open-sourced and adopted by three other companies. More importantly, on-call at Meridian is no longer something engineers dread—it’s a normal part of the rotation that rarely disrupts sleep.

“The best infrastructure is invisible,” Priya said. “The best staff engineer work is too. If you did it right, people don’t remember it as your project—they just remember that things got better.”

2025-11-24

../