The Observability Migration That Taught Me Scope
The Observability Migration That Taught Me Scope
The Setup
Maria had been a Staff Engineer at a fintech company for six months when the CTO approached her with what seemed like a straightforward technical project: migrate the company’s observability stack from a legacy system to a modern platform. The company was spending $400K annually on their current solution, experiencing data loss during high-traffic periods, and engineers complained daily about poor query performance.
“We need someone senior to own this,” the CTO said. “You’ve done infrastructure work before. Should take a quarter, maybe two.”
Maria accepted immediately. It felt like the perfect Staff-level project: high impact, visible to leadership, technically interesting. She’d migrated monitoring systems at her previous company. How different could this be?
Very different, it turned out.
The First Month: Scope Explosion
Maria started by doing what she’d always done: diving into technical research. She evaluated vendors, built proof-of-concepts, benchmarked performance, and designed a migration architecture. By week three, she had a 40-page design document outlining a phased migration approach.
Then reality hit.
The security team wanted encryption at rest and strict access controls. The compliance team needed audit logs for financial regulations. The platform team was in the middle of a Kubernetes migration and couldn’t support another infrastructure change. The application teams were each using different logging libraries, metric formats, and tracing approaches. One team had built a custom dashboarding system on top of the current tool that would break completely with the migration.
What Maria thought was a technical project became an organizational puzzle with a dozen moving pieces, each controlled by different stakeholders with different priorities.
“I spent my entire first month talking to people,” Maria recalls. “I barely wrote any code. My old senior engineer self would have thought I wasn’t making progress. But I was learning that at Staff level, understanding the problem IS the progress.”
The Pivot: From Solution to Strategy
Maria’s breakthrough came during a conversation with another Staff Engineer, James, who asked her a simple question: “What are you actually trying to solve?”
“The observability migration,” Maria answered automatically.
“No,” James pushed back. “Why does that matter? What’s the real problem?”
Maria paused. The real problems were:
- Engineers couldn’t debug production issues quickly
- The system couldn’t handle traffic spikes without losing data
- The cost was eating into the infrastructure budget
- Each team had invented their own observability approach, making cross-team debugging impossible
“So the migration is a solution, not the problem,” James said. “What if there are other solutions?”
This reframed everything.
Maria went back to stakeholders with a different conversation. Instead of “Here’s how we’ll migrate to the new platform,” she asked “What observability problems prevent you from shipping features confidently?”
The answers revealed something surprising: most teams didn’t care about the migration. They cared about:
- Faster time to debug production issues
- Better alerting to catch problems before customers noticed
- Visibility into distributed transactions across services
- Reducing the learning curve for new engineers
Some of these problems could be solved without any migration at all.
The Scope Decision
Maria faced a critical choice that defines Staff-level work: should she solve the problem she was asked to solve, or the problem she discovered?
She wrote a new document - much shorter than her first - outlining three approaches:
Option 1: Full Migration (Original Plan)
- Timeline: 9-12 months
- Cost: $200K in engineering time + $300K annual platform cost
- Risk: High - touches all teams, requires coordination
- Impact: Solves cost and scalability, may not solve usability
Option 2: Incremental Improvement (New Approach)
- Standardize logging/metrics libraries across teams (3 months)
- Implement distributed tracing for top 10 critical flows (2 months)
- Build shared dashboards and runbooks (ongoing)
- Migrate only the bottleneck components (4 months)
- Timeline: 9 months phased rollout
- Cost: $100K engineering time + keep current platform
- Risk: Medium - gradual adoption, less disruption
- Impact: Solves immediate pain points, defers cost problem
Option 3: Hybrid Approach (Recommended)
- Standardize instrumentation (month 1-3)
- Migrate metrics only to new platform (month 3-5)
- Keep logs on current platform with improved retention policies
- Add distributed tracing as new system (month 4-6)
- Evaluate full migration after 6 months of data
- Timeline: 6 months to major improvements, revisit full migration later
- Cost: $120K engineering time + $250K annual platform cost
- Risk: Low - validates approach incrementally
- Impact: Addresses 80% of pain points at 40% of the effort
Maria recommended Option 3, explicitly calling out what she was NOT doing: a complete rip-and-replace migration.
The Lesson in Saying No
The CTO’s response surprised her: “This is exactly the kind of thinking I hired you for. A senior engineer would have just done the migration because that’s what I asked for. You figured out what I actually needed.”
But not everyone was happy.
The vendor they’d been evaluating for the full migration had promised a 50% cost reduction. The infrastructure team had already started planning for the new platform. One engineering manager complained that Maria was “scaling back” the project and “lowering her ambition.”
Maria had to say no to good ideas:
- No to the vendor’s advanced features that would be great but weren’t critical
- No to migrating everything at once, even though it would be “cleaner”
- No to building a perfect unified observability platform, even though it would be impressive
“The hardest part of Staff engineering isn’t technical complexity,” Maria reflects. “It’s scoping work to maximize impact with finite resources. Anyone can make things more complex. The skill is making them simpler.”
The Results
Six months later:
- Mean time to detection for critical bugs dropped from 45 minutes to 8 minutes
- 90% of production debugging happened through standardized dashboards
- New engineers could navigate observability tools after a single onboarding session
- Costs decreased by 35% through better retention policies and selective migration
- Zero production incidents caused by the observability changes themselves
But the most important metric was invisible: the engineering hours saved. Maria’s approach freed up three months of engineering time compared to the full migration, allowing teams to focus on product features instead of infrastructure migration.
Key Takeaways for Staff Engineers
1. Reframe Solutions as Problems
When given a project, ask “why does this matter?” until you understand the root problem. Often the best solution is different from the one initially proposed.
2. Scope is a Feature, Not a Compromise
Reducing scope isn’t “thinking small.” It’s being strategic about where to apply finite resources. The best Staff engineers are surgeons, not bulldozers.
3. The Power of “Not Yet”
Maria didn’t say no to the full migration forever - she said “not yet, let’s validate the approach first.” This preserved optionality while reducing risk.
4. Make the Implicit Explicit
Writing down what you’re NOT doing is as important as what you ARE doing. It prevents scope creep and aligns expectations.
5. Influence Requires Alternatives
Saying “we shouldn’t do this” without offering alternatives is just complaining. Saying “instead of X, we could do Y or Z” is leadership.
6. Progress Isn’t Always Code
Maria’s most valuable work in month one was talking to stakeholders, not writing code. Senior ICs sometimes struggle with this - it feels less “real” than technical work. But understanding the landscape IS technical work at Staff level.
7. Finite Time Framework
Maria couldn’t be the one implementing every piece of the solution. She defined the strategy, created standards, and empowered teams to execute within those guardrails. She reviewed rather than wrote. She unblocked rather than built.
The Career Growth Insight
This project became Maria’s promotion case study when she advanced to Senior Staff Engineer a year later. Not because of the technical brilliance of the observability architecture, but because she demonstrated the judgment to:
- Reframe a technical project as an organizational challenge
- Scope work to maximize impact with minimum disruption
- Navigate stakeholder conflicts and build consensus
- Say no strategically while offering better alternatives
- Deliver meaningful results in half the expected timeline
“I thought Staff engineering was about being the best technical problem-solver,” Maria says. “It’s actually about being the best problem-finder and scope-setter. The technical execution is table stakes. The differentiation is knowing which problems to solve and which to defer.”
Questions for Reflection
For engineers navigating the Staff+ path:
- When given a project, do you start with technical research or stakeholder discovery?
- Can you articulate what you’re explicitly NOT doing in your current project?
- Are you solving the problem you were asked to solve, or the one that actually matters?
- How much of your time is spent understanding versus executing?
- When was the last time you recommended reducing scope to increase impact?
The answers reveal whether you’re thinking like a senior engineer or a staff engineer. Both are valuable, but they’re fundamentally different jobs.