The Incident Commander Who Never Ran an Incident
The Incident Commander Who Never Ran an Incident
Sarah had been at the company for six years, rising from senior engineer to staff engineer. She’d built critical systems, mentored dozens of engineers, and made architecture decisions that saved millions. But she’d never been incident commander during a major outage.
Until the day the payment system went down during Black Friday weekend.
The Call
2:47 AM on the biggest shopping day of the year. Sarah’s phone erupted. The on-call rotation had already paged three engineers. Nobody could figure out what was wrong. Transactions were failing at a 40% rate. The company was losing $50,000 per minute.
The VP of Engineering sent a direct message: “Sarah, we need you on this.”
Sarah felt her stomach drop. She understood the payment system deeply—she’d designed parts of it—but she’d never commanded an incident. That was always someone else. The senior managers. The platform team leads. Not her.
But everyone was already on the call. Twelve engineers talking over each other. Nobody coordinating. Panic spreading.
Sarah unmuted: “Okay everyone, this is now a severity-1 incident. I’m incident commander. Please mute unless you have confirmed information or a specific question.”
What She Learned About Leadership
1. Authority Comes From Context, Not Title
Sarah’s first instinct was to defer to the senior engineering manager on the call. He had managed teams for years. Surely he should lead?
But he was in infrastructure. He didn’t know the payment flow. He was trying to debug Kubernetes networking while the actual problem was in transaction processing logic.
The lesson: Technical leadership in crisis comes from who has the best context, not who has the highest title. Sarah knew the system. That made her the right leader.
She made the call: “Mike, I need you to focus on keeping the infrastructure stable. Don’t try to fix the payment issue—that’s creating noise. Can you confirm our pods are healthy and traffic is routing correctly?”
Clear roles. Clear responsibility. The chaos started organizing.
2. Pattern Recognition Beats Deep Analysis
Twenty minutes into the incident, Sarah noticed something in the metrics. Transaction failures were correlated with a specific payment provider. Not causally linked—just correlated at 0.73.
A senior engineer suggested diving deep into that provider’s API logs. It would take 30 minutes to pull and analyze.
Sarah had a different intuition. “That provider handles 40% of our traffic. A 0.73 correlation on 40% of traffic doesn’t explain 40% failure rate on 100% of traffic.”
She’d seen this pattern before, three years earlier, during a much smaller incident. The real problem was in the fallback logic. When one provider started timing out, the fallback wasn’t handling partial failures correctly.
The lesson: Pattern recognition from years of experience is more valuable during incidents than deep first-principles analysis. Staff engineers have seen enough systems fail to recognize failure modes.
She directed: “Check the timeout handling in payment-gateway-service, specifically the provider fallback chain. I think we’re double-counting timeouts.”
They found it in four minutes. A recent deployment had changed timeout accounting. When the primary provider slowed down (not failing, just slow), the fallback logic triggered. But it also marked the transaction as failed even though the fallback succeeded.
Fix deployed at 3:31 AM. Transaction success rate back to 99.2% by 3:40 AM.
The Real Work Began After
Sarah thought she was done. The system was working. Everyone could go back to sleep.
But the VP of Engineering messaged her: “Good work. Can you run the post-mortem?”
This is where Sarah learned what Staff Engineer leadership really meant.
The Post-Mortem That Changed The Team
Sarah scheduled the post-mortem for Monday. She could have written a standard document: root cause, timeline, action items. Check the box. Move on.
Instead, she asked herself: “Why did twelve smart engineers fail to solve this for 45 minutes until I joined?”
The answer was uncomfortable: The team lacked shared context. Everyone knew their piece of the system. Nobody understood the whole payment flow. When things broke across boundaries, there was no shared mental model to coordinate around.
The lesson: Technical incidents reveal organizational gaps. Staff engineers fix both the technical problem and the organizational problem.
Sarah’s post-mortem included:
- Standard root cause analysis
- A new payment system architecture diagram showing every service and their failure modes
- A proposed incident response playbook specific to payment failures
- A recommendation for quarterly payment system “game days” where the team practices coordinated debugging
Turning Crisis Into Capability
Two weeks after the incident, Sarah ran the first payment system game day. She artificially introduced failures and made engineers practice the incident response process.
It felt awkward. People complained it was “fake” and “contrived.”
But three months later, when another payment incident occurred, the team resolved it in eleven minutes. Without Sarah.
The lesson: Staff engineers build capability, not dependency. The goal isn’t to be the hero. It’s to make heroics unnecessary.
What Changed
Six months after the Black Friday incident, Sarah reflected on what shifted:
Before: Influence Through Expertise
Sarah’s influence came from being the smartest person in the room. Engineers asked her questions. She provided answers. She was a knowledge repository.
After: Influence Through Systems
Sarah’s influence came from building systems that made the team more capable. The architecture diagrams. The playbooks. The practice sessions. She created structures that made everyone smarter.
One engineer told her: “I used to think Staff Engineers were just really good senior engineers. Now I realize you’re building the scaffolding that lets the rest of us do better work.”
The Uncomfortable Truth
Sarah also learned something she didn’t like: Being incident commander was terrifying because it was high-stakes and highly visible. But that visibility was precisely what gave her leverage to drive organizational change.
The uncomfortable truth: Sometimes you need a crisis to get permission to fix underlying problems.
The architecture diagrams Sarah proposed? She’d tried to create them nine months earlier. Nobody prioritized it. Too busy shipping features.
After the Black Friday incident? Full team buy-in. Two engineers dedicated to documentation for a month.
The lesson for Staff Engineers: Crisis creates urgency, but you need to be ready with solutions. Sarah had already been thinking about these problems. When the moment arrived, she had answers ready.
Key Takeaways
1. Leadership Is Contextual
You don’t need permission to lead if you have the most relevant context. Step up when your expertise is what the moment needs.
2. Pattern Recognition Is A Superpower
Years of experience seeing systems fail builds an intuition that’s faster and more accurate than analysis under pressure.
3. Fix The System, Not Just The Symptom
Technical incidents are windows into organizational gaps. Use them to drive structural improvements.
4. Build Capability, Not Dependency
The mark of great technical leadership is the team performing better when you’re not there.
5. Prepare Before The Crisis
Have solutions ready for the problems you see coming. When crisis creates urgency, you need to move fast.
The Question Sarah Now Asks
After every incident, after every major decision, Sarah asks: “What system can I build so this problem gets easier for everyone?”
Not: “How do I solve this?”
But: “How do I make this solvable by the team?”
That’s the shift from senior engineer to Staff Engineer. From individual excellence to collective capability.
Sarah became incident commander once. She never needed to do it again. That was the point.