The Availability Target That Defined a Platform

The Setup

Sarah had been a Staff Engineer at a financial technology company for six months when the VP of Engineering asked a seemingly simple question during a platform review: “What’s our availability target?”

The team had built a payment processing platform serving 200+ microservices. They’d focused on features, throughput, and developer experience. But availability? They had monitoring. They had incident response. They had on-call rotation. What they didn’t have was an explicit target.

“We aim for high availability,” someone offered.

The VP nodded. “Define ‘high.’”

The Investigation

Sarah volunteered to research availability targets. She expected a two-week project. It became a three-month journey that reshaped the platform.

Starting with the Numbers

She began with industry standards:

99.9% (three nines): 43 minutes downtime/month
99.95%: 21 minutes downtime/month
99.99% (four nines): 4.3 minutes downtime/month
99.999% (five nines): 26 seconds downtime/month

“We should go for five nines,” a senior engineer suggested. “We’re handling money.”

Sarah wasn’t convinced. Five nines sounded great in theory. But what did it actually mean?

The Real Cost of Nines

She started mapping what each additional nine would require:

For 99.9%:

Basic monitoring and alerting
Manual failover capabilities
Recovery time objective (RTO) of ~30 minutes
Estimated cost: Current state

For 99.95%:

Automated health checks
Multi-AZ deployment
Database replicas
Estimated incremental cost: $40K/year

For 99.99%:

Active-active multi-region
Automated failover
Distributed databases with strong consistency
Estimated incremental cost: $200K/year

For 99.999%:

Global traffic management
Real-time data replication
Zero-downtime deployment infrastructure
24/7 on-call team
Estimated incremental cost: $800K+/year

But the real revelation came when she analyzed their actual incidents.

The Incident Pattern

Looking at six months of data:

Total incidents: 23
Average time to detect: 12 minutes
Average time to resolve: 34 minutes
Incidents caused by deployments: 15 (65%)
Incidents caused by infrastructure: 5 (22%)
Incidents caused by dependencies: 3 (13%)

Most incidents shared a pattern: They affected specific customer segments or API endpoints, not the entire platform.

A critical insight emerged: Their current “downtime” wasn’t platform-wide. It was partial degradation.

The Redefinition

Sarah wrote a proposal that challenged the team’s thinking:

Instead of a Single Availability Target, Define Service Level Objectives (SLOs) by Impact

Critical Path (Payment Authorization):

Target: 99.95% success rate
Latency: p99 < 500ms
Rationale: Directly affects revenue, but most failures are retryable

Payment Settlement:

Target: 99.99% success rate
Latency: p99 < 5 seconds
Rationale: Not user-facing, but financial accuracy critical

Dashboard APIs:

Target: 99.5% success rate
Latency: p99 < 2 seconds
Rationale: Important for UX, but degradation is acceptable

Batch Reporting:

Target: 99% success rate
Latency: Best effort
Rationale: Non-critical, can be retried

The Controversial Part

The proposal included explicit error budgets:

Critical path gets 22 minutes/month of errors
Team can “spend” this budget on:
- Faster deployments
- Riskier experiments
- Dependency upgrades

If the budget is exhausted, freeze all non-critical changes until next month.

The Pushback

“This is too complex”

Product managers worried about tracking multiple SLOs. Sarah created a dashboard showing each service’s error budget in simple traffic light colors. Green = healthy budget, yellow = watch carefully, red = change freeze.

“We can’t tell customers we’re only 99.5% available”

Sarah reframed it: “We’re promising specific capabilities at specific reliability levels. Customers care about payment success rate, not whether the analytics dashboard loads instantly.”

“Error budgets will make engineers lazy”

This was the hardest objection. Sarah addressed it by showing data: Teams with error budgets deployed more frequently while maintaining better availability. The budget created psychological safety to innovate.

The Implementation

Sarah didn’t just write a document. She built the infrastructure:

The SLO Monitoring System

Automated SLI collection from existing metrics
Real-time error budget tracking visible in team dashboards
Alerts when budget burns too quickly (e.g., 50% spent in one week)
Monthly budget reset with automatic reports

The Change Approval Process

Before deployment:
1. Check error budget status
2. If red, require VP approval for non-critical changes
3. If yellow, require extra scrutiny and rollback plan
4. If green, proceed with standard review

The Cultural Shift

She ran workshops teaching teams:

How to calculate availability from error budgets
How to trade off velocity vs reliability
How to communicate SLO breaches to stakeholders
How to use error budgets to justify reliability investments

The Results (Six Months Later)

Quantitative Impact

Deployment frequency: Up 40% (from 3/week to 4.2/week)
Critical path availability: Improved from 99.91% to 99.96%
Mean time to recovery: Down 28% (from 34 to 24 minutes)
Infrastructure costs: Optimized by $60K/year (avoided over-engineering non-critical paths)

Qualitative Impact

Product conversations changed: From “is it up?” to “what success rate do we need?”
Engineers felt empowered: Error budgets justified technical debt work
Incidents became learning opportunities: Focus shifted from blame to budget management

The Unexpected Win

One team had a struggling service with frequent incidents. Traditional thinking would demand higher reliability investment. But SLO analysis revealed the service had a 98% target and was performing at 98.5%. The real issue was miscommunicated expectations.

By clarifying the service’s actual criticality, the team avoided $100K in unnecessary infrastructure work.

The Lessons

1. Availability is Not Binary

“High availability” is meaningless without context. Different components need different reliability based on business impact.

2. Constraints Enable Velocity

Error budgets gave teams permission to fail in controlled ways. This increased innovation and deployment frequency.

3. Visibility Drives Behavior

Making error budgets visible and actionable changed team decision-making more than any policy could.

4. Data Beats Intuition

Engineers wanted five nines because it sounded right. Data showed where reliability actually mattered.

5. Staff Engineers Define “Done”

Sarah’s impact wasn’t technical architecture. It was defining what “reliable enough” meant and building systems to measure it.

The Career Impact

Six months after the SLO project, Sarah was promoted to Principal Engineer. The promotion feedback highlighted her:

System-level thinking: Seeing beyond technical implementation to business impact
Organizational influence: Changing how teams made decisions across the company
Pragmatic engineering: Balancing idealism with practical constraints
Multiplicative impact: Creating frameworks others could use

She didn’t write much code during the SLO project. She wrote monitoring queries, dashboards, and documentation. Most importantly, she created a shared language for reliability conversations.

Key Takeaways for Staff Engineers

Question implicit assumptions: “High availability” meant different things to everyone
Make tradeoffs explicit: Error budgets turned invisible decisions into visible resource management
Build frameworks, not one-time solutions: SLOs became reusable across teams
Data-driven persuasion: Incident analysis overcame emotional arguments
Infrastructure for decision-making: Good tooling changes how people think

The availability target didn’t just define the platform. It defined Sarah’s approach to staff engineering: Find the unasked questions, make the invisible visible, and build systems that help others make better decisions.

2025-11-27

../