The Availability Target That Defined a Platform
The Availability Target That Defined a Platform
The Setup
Sarah had been a Staff Engineer at a financial technology company for six months when the VP of Engineering asked a seemingly simple question during a platform review: “What’s our availability target?”
The team had built a payment processing platform serving 200+ microservices. They’d focused on features, throughput, and developer experience. But availability? They had monitoring. They had incident response. They had on-call rotation. What they didn’t have was an explicit target.
“We aim for high availability,” someone offered.
The VP nodded. “Define ‘high.’”
The Investigation
Sarah volunteered to research availability targets. She expected a two-week project. It became a three-month journey that reshaped the platform.
Starting with the Numbers
She began with industry standards:
- 99.9% (three nines): 43 minutes downtime/month
- 99.95%: 21 minutes downtime/month
- 99.99% (four nines): 4.3 minutes downtime/month
- 99.999% (five nines): 26 seconds downtime/month
“We should go for five nines,” a senior engineer suggested. “We’re handling money.”
Sarah wasn’t convinced. Five nines sounded great in theory. But what did it actually mean?
The Real Cost of Nines
She started mapping what each additional nine would require:
For 99.9%:
- Basic monitoring and alerting
- Manual failover capabilities
- Recovery time objective (RTO) of ~30 minutes
- Estimated cost: Current state
For 99.95%:
- Automated health checks
- Multi-AZ deployment
- Database replicas
- Estimated incremental cost: $40K/year
For 99.99%:
- Active-active multi-region
- Automated failover
- Distributed databases with strong consistency
- Estimated incremental cost: $200K/year
For 99.999%:
- Global traffic management
- Real-time data replication
- Zero-downtime deployment infrastructure
- 24/7 on-call team
- Estimated incremental cost: $800K+/year
But the real revelation came when she analyzed their actual incidents.
The Incident Pattern
Looking at six months of data:
- Total incidents: 23
- Average time to detect: 12 minutes
- Average time to resolve: 34 minutes
- Incidents caused by deployments: 15 (65%)
- Incidents caused by infrastructure: 5 (22%)
- Incidents caused by dependencies: 3 (13%)
Most incidents shared a pattern: They affected specific customer segments or API endpoints, not the entire platform.
A critical insight emerged: Their current “downtime” wasn’t platform-wide. It was partial degradation.
The Redefinition
Sarah wrote a proposal that challenged the team’s thinking:
Instead of a Single Availability Target, Define Service Level Objectives (SLOs) by Impact
Critical Path (Payment Authorization):
- Target: 99.95% success rate
- Latency: p99 < 500ms
- Rationale: Directly affects revenue, but most failures are retryable
Payment Settlement:
- Target: 99.99% success rate
- Latency: p99 < 5 seconds
- Rationale: Not user-facing, but financial accuracy critical
Dashboard APIs:
- Target: 99.5% success rate
- Latency: p99 < 2 seconds
- Rationale: Important for UX, but degradation is acceptable
Batch Reporting:
- Target: 99% success rate
- Latency: Best effort
- Rationale: Non-critical, can be retried
The Controversial Part
The proposal included explicit error budgets:
- Critical path gets 22 minutes/month of errors
- Team can “spend” this budget on:
- Faster deployments
- Riskier experiments
- Dependency upgrades
If the budget is exhausted, freeze all non-critical changes until next month.
The Pushback
“This is too complex”
Product managers worried about tracking multiple SLOs. Sarah created a dashboard showing each service’s error budget in simple traffic light colors. Green = healthy budget, yellow = watch carefully, red = change freeze.
“We can’t tell customers we’re only 99.5% available”
Sarah reframed it: “We’re promising specific capabilities at specific reliability levels. Customers care about payment success rate, not whether the analytics dashboard loads instantly.”
“Error budgets will make engineers lazy”
This was the hardest objection. Sarah addressed it by showing data: Teams with error budgets deployed more frequently while maintaining better availability. The budget created psychological safety to innovate.
The Implementation
Sarah didn’t just write a document. She built the infrastructure:
The SLO Monitoring System
- Automated SLI collection from existing metrics
- Real-time error budget tracking visible in team dashboards
- Alerts when budget burns too quickly (e.g., 50% spent in one week)
- Monthly budget reset with automatic reports
The Change Approval Process
Before deployment:
1. Check error budget status
2. If red, require VP approval for non-critical changes
3. If yellow, require extra scrutiny and rollback plan
4. If green, proceed with standard review
The Cultural Shift
She ran workshops teaching teams:
- How to calculate availability from error budgets
- How to trade off velocity vs reliability
- How to communicate SLO breaches to stakeholders
- How to use error budgets to justify reliability investments
The Results (Six Months Later)
Quantitative Impact
- Deployment frequency: Up 40% (from 3/week to 4.2/week)
- Critical path availability: Improved from 99.91% to 99.96%
- Mean time to recovery: Down 28% (from 34 to 24 minutes)
- Infrastructure costs: Optimized by $60K/year (avoided over-engineering non-critical paths)
Qualitative Impact
- Product conversations changed: From “is it up?” to “what success rate do we need?”
- Engineers felt empowered: Error budgets justified technical debt work
- Incidents became learning opportunities: Focus shifted from blame to budget management
The Unexpected Win
One team had a struggling service with frequent incidents. Traditional thinking would demand higher reliability investment. But SLO analysis revealed the service had a 98% target and was performing at 98.5%. The real issue was miscommunicated expectations.
By clarifying the service’s actual criticality, the team avoided $100K in unnecessary infrastructure work.
The Lessons
1. Availability is Not Binary
“High availability” is meaningless without context. Different components need different reliability based on business impact.
2. Constraints Enable Velocity
Error budgets gave teams permission to fail in controlled ways. This increased innovation and deployment frequency.
3. Visibility Drives Behavior
Making error budgets visible and actionable changed team decision-making more than any policy could.
4. Data Beats Intuition
Engineers wanted five nines because it sounded right. Data showed where reliability actually mattered.
5. Staff Engineers Define “Done”
Sarah’s impact wasn’t technical architecture. It was defining what “reliable enough” meant and building systems to measure it.
The Career Impact
Six months after the SLO project, Sarah was promoted to Principal Engineer. The promotion feedback highlighted her:
- System-level thinking: Seeing beyond technical implementation to business impact
- Organizational influence: Changing how teams made decisions across the company
- Pragmatic engineering: Balancing idealism with practical constraints
- Multiplicative impact: Creating frameworks others could use
She didn’t write much code during the SLO project. She wrote monitoring queries, dashboards, and documentation. Most importantly, she created a shared language for reliability conversations.
Key Takeaways for Staff Engineers
- Question implicit assumptions: “High availability” meant different things to everyone
- Make tradeoffs explicit: Error budgets turned invisible decisions into visible resource management
- Build frameworks, not one-time solutions: SLOs became reusable across teams
- Data-driven persuasion: Incident analysis overcame emotional arguments
- Infrastructure for decision-making: Good tooling changes how people think
The availability target didn’t just define the platform. It defined Sarah’s approach to staff engineering: Find the unasked questions, make the invisible visible, and build systems that help others make better decisions.