The Observability Platform That Started with a Spreadsheet

The Problem Nobody Wanted to Acknowledge

Sarah had been a Staff Engineer at a rapidly growing fintech company for six months when she noticed something troubling. Every production incident involved the same painful pattern: engineers scrambling across multiple monitoring tools, Slack channels exploding with speculation, and executives asking “when will this be fixed?” while engineers were still trying to figure out “what exactly is broken?”

The company used seven different observability tools. Logs went to three different places depending on which team owned the service. Metrics dashboards were scattered across Grafana, Datadog, and custom internal tools. Tracing existed for some services but not others. Nobody had a complete picture of anything.

When Sarah raised this in the engineering all-hands, the VP of Engineering acknowledged it was “something we should eventually tackle.” But there were always more pressing features to ship, more customer escalations to handle, more immediate fires to fight.

The Spreadsheet That Changed Everything

Sarah didn’t ask for permission to build a new observability platform. Instead, she opened a spreadsheet.

For two weeks, during every incident, she quietly documented:

Which tools engineers checked
How long it took to find relevant information
How many context switches were required
The cost of each tool per month
Which data sources actually led to root cause identification

The spreadsheet revealed something shocking: Engineers spent an average of 47 minutes per incident just gathering context before they could begin actual debugging. For major incidents, this jumped to 2+ hours. The company was averaging 15 incidents per month.

Then came the financial analysis. The seven observability tools cost $42,000 per month combined. But the engineering time wasted on fragmented observability cost an estimated $180,000 per month in lost productivity - 4x the tooling cost.

Sarah didn’t present this as “we need better observability.” She presented it as “we’re burning $2.1M annually on invisible costs.”

Building Influence Without Authority

Sarah had no direct reports. She couldn’t assign anyone to work on this. She couldn’t mandate that teams adopt a new approach. But she had data, and she had a strategy.

Step 1: Find the allies

She identified three engineers who had expressed frustration with observability during recent incidents. She shared her spreadsheet privately. Each of them had experienced the pain and immediately saw themselves in the data.

“Want to prototype something better?” she asked. Not “help me build a platform,” but “want to experiment with solving this?”

Step 2: Start absurdly small

They didn’t build a platform. They built a single page - a dashboard that aggregated links to existing tools, organized by service. It took two days.

They called it “The War Room” and shared it in Slack: “Next time there’s an incident, try starting here. It won’t solve everything, but it’s one URL instead of seven.”

The next incident happened three days later. Engineers used The War Room. It saved 20 minutes. People noticed.

Step 3: Make it easy to contribute

Sarah added a “suggest a link” feature. Other engineers started adding their team’s dashboards. The War Room grew organically. She wasn’t building it alone anymore - she was curating contributions.

Within a month, The War Room was the default incident response starting point.

Step 4: Let the platform find you

Sarah didn’t pitch building a real observability platform. She let the limitations of The War Room do the talking. Engineers started asking: “Could The War Room show service health status directly?” “Could it automatically pull in recent deployments?” “Could it correlate logs and metrics?”

Each question was a vote for something more sophisticated.

Three months after the spreadsheet, Sarah got a Slack message from the VP of Engineering: “I want to talk about formalizing this observability work. Can we make it a real project?”

The Architecture Decision

Now Sarah faced the classic Staff Engineer dilemma: build vs buy vs integrate.

The company had already spent $42K/month on seven tools. The instinct was to consolidate to one vendor - “rip out the old, bring in the new.”

Sarah knew this was wrong.

She spent a week interviewing every engineering team about their observability needs. Backend teams cared about request traces and database query performance. Frontend teams needed real-user monitoring and error tracking. Infrastructure teams needed system metrics and cost analytics. ML teams needed model performance metrics.

No single vendor solved all these problems well. Consolidating to one would mean forcing teams into suboptimal tools, breeding resentment and workarounds.

Her proposal was counterintuitive: Keep most of the existing tools, but build a unified interface layer.

The architecture had three principles:

Single pane of glass, not single tool - Aggregate data from existing tools rather than replacing them
Service-centric, not tool-centric - Navigate by “what’s wrong with the checkout service?” not “let me check Datadog”
Progressive disclosure - Show high-level health by default, drill into detailed tools only when needed

The implementation was pragmatic:

Use existing tools’ APIs to pull key metrics
Build a lightweight service catalog as the navigation layer
Create standardized “service health” views with links to detailed dashboards
Add correlation: show deployments, incidents, and metrics on a unified timeline

Total build time: 8 weeks with a team of 3 engineers (20% time allocation)

Cost: $15K in development time + existing tool costs

The Results

Six months after launch:

Mean time to context (MTTC) dropped from 47 minutes to 8 minutes
Incident resolution time decreased by 35% on average
Tool consolidation: eliminated 2 redundant tools, saving $14K/month
Adoption: 94% of incidents started with the observability platform
Cultural shift: teams started instrumenting services better because good instrumentation now had visible payoff

But the most important outcome was invisible: engineering confidence.

Engineers stopped dreading incidents. They had a system they trusted. New engineers could respond to production issues within their first week instead of needing months to learn the monitoring landscape.

The Lessons

Sarah’s journey from spreadsheet to platform teaches several lessons about Staff Engineer impact:

1. Make the invisible visible

Most organizational problems hide in the gaps between systems. Spreadsheets, lightweight dashboards, and simple automation can illuminate these gaps faster than grand proposals.

2. Influence follows proof

Sarah didn’t get buy-in for an observability platform. She built momentum with The War Room, and buy-in found her. Showing > Telling.

3. Start with the user journey, not the architecture

The War Room succeeded because it solved the incident responder’s journey (“where do I look?”), not because it had elegant architecture. The sophisticated platform came later, informed by real usage.

4. Integration beats replacement

Staff Engineers inherit existing technical landscapes. The instinct is to clean slate everything. But meeting teams where they are - integrating rather than replacing - reduces friction and increases adoption.

5. Track leading indicators

Sarah tracked time-to-context, not just time-to-resolution. Leading indicators (how long to gather information) reveal bottlenecks that lagging indicators (total incident time) obscure.

6. Build with, not for

By making The War Room contribution-friendly from day one, Sarah created ownership across teams. The platform was “ours” not “Sarah’s thing.”

The Career Growth Angle

This project was Sarah’s promotion case to Principal Engineer.

Not because she built impressive technology, but because she demonstrated:

Strategic vision: Saw the observability gap and its business impact
Organizational awareness: Navigated without authority through data, prototypes, and coalition-building
Technical judgment: Made pragmatic build-vs-buy decisions
Execution: Delivered measurable business outcomes
Leadership: Created a contribution model that scaled beyond herself

She didn’t wait for permission to lead. She identified a multiplier problem - something that made everyone else more effective - and systematically solved it.

The Bottom Line

Staff Engineers don’t succeed by building the most technically sophisticated systems. They succeed by identifying the leverage points where technical solutions unlock organizational effectiveness.

Sarah’s spreadsheet became a platform. But it started as curiosity about why incidents felt so chaotic, and a willingness to measure the invisible tax of fragmentation.

The best Staff Engineer work often starts not with a design document, but with a question: “Why does this feel harder than it should be?”

2025-10-18

../