The Monitoring Dashboard That Nobody Wanted
The Monitoring Dashboard That Nobody Wanted
Sarah Chen had been at the payments company for three years when she got promoted to Staff Engineer. Her first major initiative seemed straightforward: build a unified observability dashboard for the platform team. The engineering director loved the idea. The VP of Engineering approved the budget. Sarah assembled a small team and set a six-month timeline.
Nine months later, the dashboard was technically flawless—and completely unused.
This is the story of how Sarah learned that Staff Engineering isn’t about building the best technical solution. It’s about solving the right problem.
The Perfect Solution to the Wrong Problem
The dashboard was beautiful. Real-time metrics, distributed tracing, log aggregation, anomaly detection powered by ML—it had everything. Sarah’s team had used modern tech: React frontend, GraphQL API, Prometheus and Grafana integration, custom alerting engine.
But two months after launch, usage analytics told a brutal truth:
- 12% weekly active users (mostly Sarah’s team)
- Average session time: 43 seconds
- Zero alerts configured by product teams
- No adoption by the SRE team who were the supposed primary users
The director who championed the project had moved to another company. The new leadership questioned why they were spending $40K/month on infrastructure nobody used.
The Uncomfortable Truth
Sarah did what she should have done nine months earlier: she talked to the people who were supposed to use the dashboard.
The SRE team: “We already have dashboards. They’re not pretty, but we know where everything is. Learning a new tool would slow us down during incidents.”
Product engineers: “This looks complicated. When there’s an issue, I just grep the logs or ping the SRE team. It’s faster.”
The platform team: “Honestly, I didn’t know this existed. Nobody told us we should be using it.”
Sarah had assumed that because observability was a problem in theory, her solution would be adopted in practice. She had built what she thought people needed rather than discovering what they actually needed.
The Pivot: From Solution to Problem
Instead of defending the dashboard or trying to force adoption, Sarah made a different choice. She shelved the project and spent the next two weeks just observing:
- She joined the SRE on-call rotation to see how they actually debugged issues
- She sat in on incident reviews and watched what tools people reached for
- She interviewed engineers about their most painful debugging experiences
- She analyzed which existing monitoring tools got the most usage
What she discovered surprised her:
The real problem wasn’t lack of visibility—it was time to resolution during incidents.
The existing monitoring tools actually provided decent visibility. The problem was that during a production incident at 2 AM, engineers were context-switching between 7 different tools, correlation was manual, and finding the root cause took too long.
Engineers didn’t need another dashboard. They needed faster time to answers during high-stress situations.
The Un-Dashboard
Sarah’s pivot was radical. Instead of a comprehensive dashboard, she built something much smaller: an incident command interface.
Key decisions:
Don’t replace existing tools - Integrate with them. Pull data from existing Grafana dashboards, PagerDuty, CloudWatch, application logs.
Optimize for incidents, not monitoring - The interface only activated during active incidents. It wasn’t for everyday monitoring.
Zero setup required - Automatically detected services and dependencies using existing service mesh metadata. No configuration needed.
AI-assisted correlation - When an alert fired, automatically correlated related metrics, recent deployments, and similar past incidents.
Slack-first interface - Instead of a web UI, most functionality was accessible via Slack commands during incidents. Engineers didn’t need to context-switch.
The entire MVP took six weeks to build—a fraction of the original dashboard effort.
The Results
Three months after launching the incident command interface:
- Mean time to resolution (MTTR) decreased by 35%
- 78% of engineering teams had used it during incidents
- SRE team made it the default starting point for incident response
- Infrastructure costs: $3K/month (vs. $40K for the unused dashboard)
More importantly, Sarah learned what distinguished Staff Engineering from senior engineering.
Lessons for Staff Engineers
1. Fall in Love with the Problem, Not Your Solution
Sarah’s initial dashboard was a solution looking for a problem. She had assumed the problem was “we need better observability dashboards” when the real problem was “incidents take too long to resolve.”
Key practice: Spend at least 20% of project time on problem validation before writing code. For every hour of design, spend 30 minutes interviewing users.
2. Adoption is a Feature, Not an Afterthought
Technical excellence doesn’t matter if nobody uses what you build. Sarah’s original dashboard was technically superior to the scrappy incident interface—but the interface solved a problem people felt acutely.
Key practice: Define success metrics that include adoption and usage, not just technical metrics. “System handles 10K RPS” is meaningless if nobody sends requests to it.
3. Work with the Grain, Not Against It
Sarah’s first instinct was to replace the existing monitoring tools because they were “suboptimal.” The successful approach integrated with existing tools and workflows.
Key practice: Understand the path of least resistance for adoption. What requires zero behavior change? What leverages existing habits?
4. Influence Requires Presence
Sarah built the original dashboard somewhat in isolation. The successful pivot happened after she embedded herself in the SRE on-call rotation and incident response.
Key practice: You can’t influence from a distance. Get close to the problem. Experience the pain firsthand.
5. Knowing When to Kill Your Darlings
The hardest moment was admitting the dashboard was a failure. Sarah could have tried to force adoption, added more features, or blamed “change resistance.”
Key practice: Set clear kill criteria before starting projects. If adoption hasn’t reached X% after Y weeks, pivot or kill. Sunk cost is not a strategy.
6. Small Wins Build Credibility
The incident command interface succeeded partly because it was narrowly scoped. It didn’t try to solve all observability problems—just one critical workflow.
Key practice: Especially when recovering from a failed project, choose a small scope with clear value. Build credibility through wins, then expand.
The Meta-Lesson
A year after the incident command interface launch, Sarah reflected on what made the difference. The original dashboard was technically more impressive. It would have looked better on a resume. It scratched more engineering itches.
But Staff Engineering isn’t about impressive technical artifacts. It’s about impact.
The dashboard nobody wanted taught Sarah that her job wasn’t to build the most sophisticated system. It was to understand the problem deeply enough to build the simplest thing that would actually get used.
That shift—from “what’s the coolest thing I could build?” to “what’s the smallest thing that would have the biggest impact?"—is the essence of the Staff Engineer role.
Questions to Ask Yourself
If you’re a Staff Engineer or aspiring to be one, consider:
Are you in love with your solution or the problem? Can you articulate the problem without referencing your solution?
Have you validated the problem with actual users? Not “would you use this if it existed?” but “walk me through the last time you encountered this problem.”
What would adoption look like? Not theoretical usage, but actual behavior change. What would people stop doing? What would they start doing?
What’s the smallest viable version? Not MVP as in “missing features,” but the minimum that would create value someone would actually pay for with their time.
What’s your kill criteria? If this isn’t working, how will you know? What would cause you to pivot?
The monitoring dashboard that nobody wanted became the incident command interface that everyone needed. But only after Sarah learned that building the right thing badly is better than building the wrong thing perfectly.