The Service Mesh That Started With a Whiteboard
The Service Mesh That Started With a Whiteboard
Maya had been a Staff Engineer at a fast-growing fintech for eighteen months when the VP of Engineering walked into her team’s space with a problem: their microservices architecture was collapsing under its own complexity. They had 200+ services, 15 different observability tools, and no consistent way to handle retries, circuit breaking, or authentication between services.
“We need a service mesh,” the VP announced. “I’m allocating two teams and six months. Maya, you’re the technical lead.”
Maya’s first instinct was to say yes. This was the kind of high-visibility infrastructure project that Staff Engineers dream about. But something made her pause.
“Give me three days,” she said. “Let me understand what we actually need first.”
The Discovery That Changed Everything
Instead of diving into service mesh evaluations, Maya did something unusual: she spent three days shadowing different engineering teams, watching how they actually built and operated services.
What she found surprised her:
- Most services didn’t talk to many other services (median was 3 dependencies)
- The “service mesh problem” was really 5 different problems experienced by different teams
- Teams were already solving these problems in inconsistent ways
- The real pain point wasn’t missing features—it was no standard way to solve common problems
On day three, Maya drew a simple diagram on a whiteboard in the main engineering area:
Our Actual Problem:
┌─────────────────────────────────────┐
│ We have standard problems │
│ but no standard solutions │
│ │
│ Results in: │
│ • 15 different retry libraries │
│ • No consistent metrics │
│ • Each team reinvents same patterns │
│ • Impossible to reason about system │
└─────────────────────────────────────┘
She left a marker and note: “What would help YOU most? Write on the whiteboard.”
By end of week, the whiteboard was covered in sticky notes. Maya took a photo and started analyzing patterns.
The Insight: Start With Standards, Not Infrastructure
Maya realized something fundamental: their organization didn’t need a service mesh. They needed agreed-upon standards for service communication, with the lightest possible infrastructure to enforce them.
She wrote a one-page doc titled “Service Communication Standards” that covered:
- How to do retries (exponential backoff, max 3 attempts)
- How to report metrics (standard labels, 99th percentile latency)
- How to authenticate service-to-service calls (short-lived tokens)
- How to handle circuit breaking (fail fast after 10 consecutive errors)
Then the critical part: “These standards can be implemented as a library, not a sidecar proxy.”
Her reasoning:
- Library = 0.1ms overhead vs 1-3ms for sidecar proxies
- Library = deploy as dependency, not new infrastructure
- Library = gradual rollout, not big-bang migration
- Library = fail in-process, not due to proxy crashes
But she knew the real challenge wasn’t technical—it was organizational.
Building Consensus Without Authority
Maya had no direct reports and no formal authority over the 200+ services. She needed buy-in from teams who were already overwhelmed and skeptical of “yet another platform initiative.”
She used three strategies:
1. Start With the Believers
She identified three teams who were actively struggling with service communication problems. She offered to pair with them for a week each, implementing the standards using a minimal library she’d prototyped.
The deal: “I’ll help solve your immediate problem. You give feedback on whether this approach works.”
All three teams shipped improvements within two weeks. More importantly, they became advocates.
2. Make Success Visible
She created a dashboard showing service communication metrics—but only for services using the new standards. Teams could see their services had blank spots on the dashboard.
She didn’t mandate adoption. She made the absence of adoption visible.
Within a month, teams were asking, “How do we get on that dashboard?”
3. Document Decisions, Not Just Code
For every design choice in the standards, Maya wrote a decision record explaining:
- What problem it solves
- Why this approach vs alternatives
- What trade-offs were made
- What circumstances might change the decision
Example decision record:
Decision: Use library, not sidecar proxy
Problem: Need consistent retry behavior across services
Why library:
- 10x lower latency overhead
- No new infrastructure to operate
- Gradual rollout possible
- Fails in-process (better debugging)
Trade-off:
- Requires code changes (not zero-touch)
- Language-specific implementations needed
- Can't enforce standards at infrastructure level
Might change if:
- We need to enforce standards for compliance
- Teams can't upgrade dependencies regularly
- Polyglot environments with 10+ languages
These decision records did two things: they built trust (people understood the reasoning) and they made it easy for others to contribute (they could see what was still open for debate).
The Moment of Resistance
Six weeks in, the infrastructure team pushed back hard. They’d been expecting to build a service mesh—installing Istio or Linkerd, becoming the “platform team.” Now Maya was proposing something that required minimal infrastructure.
“This doesn’t give us enough control,” the infrastructure lead argued. “What if teams don’t adopt the library? What if they bypass the standards?”
Maya recognized this wasn’t a technical objection—it was about team purpose and value.
She proposed a reframe: “Your team’s value isn’t controlling what teams do. It’s making the right thing easy and visible.”
Then she showed them the adoption numbers:
- 47 services using the standards library (from 0 six weeks ago)
- Average latency reduction of 23% (from removing poorly-implemented retries)
- Zero new infrastructure to operate
- 3 teams contributing improvements to the library
“You can spend six months building infrastructure that teams resist, or you can help 50 more teams adopt these standards in the next six weeks. Which adds more value?”
The infrastructure team shifted from gatekeepers to enablers. They built tooling to make adoption easier: automated pull requests, migration scripts, and dashboard integrations.
The Outcome
Twelve weeks after the VP’s original request, Maya presented results:
Metrics:
- 112 services (56%) adopted standards library
- Consistent observability across all adopted services
- P99 latency improved 31% on average
- Zero new infrastructure deployed
- 8 engineers contributed to the library
More Important:
- Teams had a standard vocabulary for service communication
- Debugging distributed issues became tractable (consistent logs/metrics)
- New services got best practices “for free” by using the library
- The organization had proven it could agree on and adopt standards
The VP was initially disappointed—this wasn’t the “big platform initiative” he’d envisioned. But when Maya showed the adoption curve and impact metrics, he understood.
“We could have spent six months building infrastructure that teams avoided,” he said. “Instead we have standards that teams actually use. That’s better.”
What Made This Staff-Level Work
Looking back, Maya identified what made this project representative of Staff Engineer impact:
1. Solved the Real Problem, Not the Stated Problem
The stated problem was “we need a service mesh.” The real problem was “we have no consistent way to solve service communication challenges.”
Maya didn’t get distracted by the solution space (which service mesh to choose). She stayed in the problem space long enough to reframe the problem itself.
2. Optimized for Adoption, Not Technical Purity
A service mesh would have been more technically complete. But it would have required:
- Months of infrastructure work before any value delivered
- Big-bang migration (high risk)
- Ongoing operational overhead
- Team training on new concepts
The library approach was technically “worse” in some ways, but infinitely better in the way that actually mattered: teams would actually use it.
3. Built Organizational Capability, Not Just Code
The library was useful, but the real output was:
- Teams learned to agree on standards
- A precedent for bottom-up adoption of practices
- Visible metrics that created social proof
- Decision records that codified organizational learning
Six months later, teams used the same pattern to standardize database access patterns. The capability to agree on and adopt standards became reusable.
4. Influenced Without Authority
Maya had zero direct reports. She couldn’t mandate adoption. Instead, she:
- Made success visible (dashboard showing compliant services)
- Made adoption easy (automated tooling, pairing sessions)
- Made the reasoning transparent (decision records)
- Made early adopters into advocates
This is the core of Staff Engineer work: multiplying impact through influence, not authority.
5. Reframed Problems for Stakeholders
When the VP wanted a service mesh, Maya could have built what was asked. Instead, she:
- Spent time understanding the real need
- Presented an alternative that met the need differently
- Showed adoption and impact data to justify the approach
- Helped stakeholders understand the trade-offs
This is pattern matching at an organizational level: recognizing that the stated solution might not solve the actual problem.
Lessons for Senior ICs
Start by understanding the problem space deeply. Maya spent three days shadowing teams before proposing solutions. This discovery phase is where Staff-level work differentiates itself from senior engineer work.
Make adoption part of the design. The best technical solution that nobody uses has zero impact. Maya designed for adoption from day one: minimal dependencies, gradual rollout, visible benefits.
Document decisions, not just code. Decision records built trust and made the reasoning transparent. This enabled others to contribute and adapt the approach.
Influence is about making the right thing easy and visible. Maya didn’t mandate adoption. She made the benefits visible (dashboard) and the adoption easy (tooling, pairing). Social proof did the rest.
Reframe problems when the stated solution doesn’t match the actual need. The courage to say “we don’t need what was asked for” is what separates order-takers from technical leaders.
Build capability, not just features. The real output wasn’t the library—it was the organization’s improved ability to agree on and adopt technical standards.
The Follow-Up
Eighteen months later, the standards library was used by 184 of 220 services. Teams contributed 50+ improvements. The infrastructure team had become enablers of standards adoption rather than gatekeepers.
More significantly, the pattern repeated: teams used the same approach to standardize database access, logging, and feature flags. The organization had learned how to learn.
Maya’s title didn’t change. She didn’t get promoted to management. But her impact multiplied through the practices and patterns that became embedded in how the organization solved problems.
That’s the work of a Staff Engineer: changing how the organization solves problems, not just solving today’s problem.