The Availability Target That Defined a Platform

The Availability Target That Defined a Platform

The Setup

Sarah had been a Staff Engineer at a financial technology company for six months when the VP of Engineering asked a seemingly simple question during a platform review: “What’s our availability target?”

The team had built a payment processing platform serving 200+ microservices. They’d focused on features, throughput, and developer experience. But availability? They had monitoring. They had incident response. They had on-call rotation. What they didn’t have was an explicit target.

“We aim for high availability,” someone offered.

The VP nodded. “Define ‘high.’”

The Investigation

Sarah volunteered to research availability targets. She expected a two-week project. It became a three-month journey that reshaped the platform.

Starting with the Numbers

She began with industry standards:

“We should go for five nines,” a senior engineer suggested. “We’re handling money.”

Sarah wasn’t convinced. Five nines sounded great in theory. But what did it actually mean?

The Real Cost of Nines

She started mapping what each additional nine would require:

For 99.9%:

For 99.95%:

For 99.99%:

For 99.999%:

But the real revelation came when she analyzed their actual incidents.

The Incident Pattern

Looking at six months of data:

Most incidents shared a pattern: They affected specific customer segments or API endpoints, not the entire platform.

A critical insight emerged: Their current “downtime” wasn’t platform-wide. It was partial degradation.

The Redefinition

Sarah wrote a proposal that challenged the team’s thinking:

Instead of a Single Availability Target, Define Service Level Objectives (SLOs) by Impact

Critical Path (Payment Authorization):

Payment Settlement:

Dashboard APIs:

Batch Reporting:

The Controversial Part

The proposal included explicit error budgets:

If the budget is exhausted, freeze all non-critical changes until next month.

The Pushback

“This is too complex”

Product managers worried about tracking multiple SLOs. Sarah created a dashboard showing each service’s error budget in simple traffic light colors. Green = healthy budget, yellow = watch carefully, red = change freeze.

“We can’t tell customers we’re only 99.5% available”

Sarah reframed it: “We’re promising specific capabilities at specific reliability levels. Customers care about payment success rate, not whether the analytics dashboard loads instantly.”

“Error budgets will make engineers lazy”

This was the hardest objection. Sarah addressed it by showing data: Teams with error budgets deployed more frequently while maintaining better availability. The budget created psychological safety to innovate.

The Implementation

Sarah didn’t just write a document. She built the infrastructure:

The SLO Monitoring System

The Change Approval Process

Before deployment:
1. Check error budget status
2. If red, require VP approval for non-critical changes
3. If yellow, require extra scrutiny and rollback plan
4. If green, proceed with standard review

The Cultural Shift

She ran workshops teaching teams:

The Results (Six Months Later)

Quantitative Impact

Qualitative Impact

The Unexpected Win

One team had a struggling service with frequent incidents. Traditional thinking would demand higher reliability investment. But SLO analysis revealed the service had a 98% target and was performing at 98.5%. The real issue was miscommunicated expectations.

By clarifying the service’s actual criticality, the team avoided $100K in unnecessary infrastructure work.

The Lessons

1. Availability is Not Binary

“High availability” is meaningless without context. Different components need different reliability based on business impact.

2. Constraints Enable Velocity

Error budgets gave teams permission to fail in controlled ways. This increased innovation and deployment frequency.

3. Visibility Drives Behavior

Making error budgets visible and actionable changed team decision-making more than any policy could.

4. Data Beats Intuition

Engineers wanted five nines because it sounded right. Data showed where reliability actually mattered.

5. Staff Engineers Define “Done”

Sarah’s impact wasn’t technical architecture. It was defining what “reliable enough” meant and building systems to measure it.

The Career Impact

Six months after the SLO project, Sarah was promoted to Principal Engineer. The promotion feedback highlighted her:

She didn’t write much code during the SLO project. She wrote monitoring queries, dashboards, and documentation. Most importantly, she created a shared language for reliability conversations.

Key Takeaways for Staff Engineers

  1. Question implicit assumptions: “High availability” meant different things to everyone
  2. Make tradeoffs explicit: Error budgets turned invisible decisions into visible resource management
  3. Build frameworks, not one-time solutions: SLOs became reusable across teams
  4. Data-driven persuasion: Incident analysis overcame emotional arguments
  5. Infrastructure for decision-making: Good tooling changes how people think

The availability target didn’t just define the platform. It defined Sarah’s approach to staff engineering: Find the unasked questions, make the invisible visible, and build systems that help others make better decisions.