PipelineOps

The Wrong SLI Almost Broke Our Reliability Culture

At a quarterly review, the head of product asked me: "Your SRE dashboards are all green. So why is the support queue constantly on fire?" I didn't have a good answer.

TL;DR: We built an availability SLI on HTTP 5xx rates. The SLO stayed green for quarters, yet user experience kept getting worse. But the deeper problem wasn't the SLI definition. It was that we were using SLO as a team report card. When the numbers diverged from reality, trust across the organization collapsed. Rebuilding the SLI around user journeys wasn't enough — we had to change how SLO was used, from "grade" to "decision trigger," before things actually improved.

What I Was Trying to Do

I was running the Platform team for a B2B SaaS product, introducing availability SLOs for the first time. Before the era of dashboards full of red gauges, our first SLI was a simple, textbook definition:

Availability SLI: the fraction of requests that did not return 5xx SLO: ≥ 99.9% over a 28-day rolling window

Nearly identical to what Google's SRE book lays out. We computed the metric from load balancer access logs. We spent three months building out an error budget policy and shared the numbers with each service owner.

The first two quarters looked healthy. SLO met, error budget in the black, dashboards green. In review meetings, I could confidently report: "reliability is stable."

What Went Wrong (and Why)

Halfway into the third quarter, something strange started happening.

  • Support tickets were up 40% quarter-over-quarter
  • Several large customers were signaling they might churn
  • "The product feels unstable lately" escalations were reaching the executive level

At that quarter's review, the head of product asked the question: why do the SRE dashboards say everything is fine while reality says otherwise?

I was skeptical at first. The SLO was green. By our measurements, nothing was broken. I pushed back — the support issues were probably UX, documentation, onboarding. Not reliability.

It took a few weeks of reading support tickets to see just how far my SLI had drifted from reality.

Cause 1: 200s that were actually errors

The frontend was an SPA. The backend returned JSON. When the backend threw an exception internally, a handful of endpoints would return 200 OK with a body like {"status": "error", "message": "..."}. In the browser, this showed up as "the screen went blank" or "the button does nothing."

That's a response-design problem on its own, and it's fixable if you use status codes correctly. But the point isn't that the design was broken. The real issue is that anchoring the SLI to "what status code did the server send" makes the SLI fragile. There's no guarantee that the server's view and the user's view agree.

Cause 2: Slow is down

The SLI had no notion of latency. A request that took 30 seconds to return 200 still counted as a success. In reality, the browser had already timed out — to the user, the product was broken.

This bit us hardest on a search API. The day the DB query planner went sideways, p95 latency climbed past 8 seconds. The SLI did not blink.

Cause 3: Auth failures were 4xx

An SSO provider had a bad day, and legitimate users kept getting 401 Unauthorized. From the application's perspective, 401 was correct — the app was blocking unauthenticated users, exactly as designed. But to the user, "I can't log in" is indistinguishable from "the site is down."

The SLI excluded 4xx from the denominator, treating them as user error. Result: the SSO outage barely registered on the SLI.

Cause 4: Dependency failures were invisible

The payment processor, the email service, third-party APIs. When they failed, features stopped working. But our servers kept returning 200 (displaying a "processing your payment" spinner that would never resolve).

We were looking only at server-side metrics, so dependency-caused outages did not show up in the SLI at all.

The worst side effect: SLO had become a "report card"

The technical holes were bad, but the thing that really hurt was what the bad SLI was doing to the organization. And once I dug into it, the SLI definition was only the surface. Underneath was a deeper problem: how we were using SLO.

SLO is supposed to be a decision trigger. "We've burned the error budget, so we freeze features and invest in reliability." "We have budget to spare, so we can take more risk." It's a threshold for making calls. But in our org, SLO had turned into something closer to the SRE team's grade. Hitting SLO in the quarterly review was the goal. If we hit it, the reliability conversation was over.

That shape quietly creates an incentive problem. When the numbers stop matching reality, there's pressure to protect the numbers rather than confront the reality. Gradually, "the SLO says we're fine" became a legitimate response to "customers are saying it's broken." The green dashboards became a running joke. SRE's voice in reliability discussions lost weight. In code reviews, people started saying things like "just return 200 so it doesn't tank the SLI." Structural improvements got deprioritized "because the SLO is green."

The dashboards stayed green. The reliability culture was quietly falling apart. Fixing the SLI definition wouldn't be enough. We had to change how the SLO was used.

The Fix — Step by Step

Step 0: Change how SLO is used

Before any technical change, we had to get agreement that SLO was a decision trigger, not a report card. We wrote down three principles:

  1. Meeting SLO means "keep doing what we're doing" — not "we're fine." Green SLO does not close the reliability conversation.
  2. SLO is not a team performance metric. Error budget burn drives prioritization for the team. It does not feed individual reviews or department KPIs.
  3. If the SLI fails to catch situations where users are clearly suffering, that counts as an SLI definition bug. A green SLO is not proof of good user experience.

Without these three locked in across SRE, product, and leadership, fixing the SLI would just reset the same trap. The SLI is a technical artifact. "What SLO is for" is an organizational one.

Step 1: Replace "server view" with "user journey view"

We narrowed in on the three most important Critical User Journeys (CUJs) for the product:

  1. Login: a user signs in (email or SSO) and lands on the dashboard
  2. Core action: the central workflow (data ingest + aggregation, in our case) completes
  3. Settings persistence: a change made in admin is saved and reflected

We ran all three via synthetic monitoring (something like Datadog Synthetics or Checkly) every minute, measuring whether a real browser could complete the journey end-to-end.

Step 2: Layer in Real User Monitoring (RUM)

Synthetic monitoring can only see what you script. To capture what actual users experience, we added RUM.

  • Page load time, p75 and p95
  • JS error rate
  • API error rate (network layer, from the client's perspective)

It took the combination of synthetic + RUM to get a usable picture of "availability as the user experiences it."

Step 3: Treat slow as down

We folded a latency threshold into the SLI itself:

Availability SLI (new): fraction of Critical User Journeys that completed successfully within 5 seconds (synthetic + RUM)

Five seconds came out of a conversation with product — their UX research showed that past five seconds, users close the tab.

Step 4: Count dependency failures

When an external dependency (payments, email, SSO, third-party API) caused a user journey to fail, we counted it as a failure in the SLI. "It wasn't our fault" doesn't reach the user.

We still tagged these as external in post-mortems so we could track them separately — but as far as the SLI was concerned, a failed journey was a failed journey.

Step 5: Recalibrate the SLO

The new SLI was harsh. In the first month, recomputed on synthetic-plus-RUM data, availability dropped from "99.95%" to 98.6%.

Leaving the SLO at 99.9% would have meant a permanently negative error budget, freezing all feature work. Not realistic.

SRE, product, and leadership agreed on a three-month target of 99.0%, with a plan to raise it each quarter. Not "make the number easier," but "start from what reality actually says, and build an improvement plan from there."

Step 6: Rebuild the trust you lost

This step took the longest.

  • SRE joined a weekly support sync to match support trends against the new SLI
  • In monthly reviews, we walked through which user journeys failed how many times, with product, support, and SRE in the same room
  • We agreed internally to stop using "the SLO is green" as a way to end an argument

The technical work took weeks. The cultural repair took about six months before product told me, unprompted, that they trusted the new numbers.

What I'd Do Differently

Design for user view from day one. Server-side metrics are easy, which is why they're tempting for a first SLI. But starting from user journeys, and adding synthetic plus RUM as part of the original design, is cheaper than retrofitting it after you've lost credibility.

A good SLI hurts when things actually hurt. Our first SLI was painless. An SLI that never fires when users are clearly suffering isn't useful. When we designed the new one, our validation question was: "does this SLI catch the last three real incidents?" If the answer was no, we kept redesigning.

Bring product and support in at design time. The first SLO was an SRE-only artifact. "Technically correct" and "organizationally meaningful" are not the same thing. The second one we designed jointly with product and support from the start.

Fix how SLO is used before you fix the SLI. If SLO is being treated as a grade, any SLI you invent will end up defending the grade. The organizational change has to come first, or come alongside the technical one.

Key Takeaways

  • SLO is a decision trigger, not a report card. Once "the SLO is green" stops closing conversations, an imperfect SLI no longer metastasizes into an organizational problem.
  • SLI should measure user experience, not server status codes. A 200 is not a success if the user saw an error page.
  • Synthetic + RUM together. Neither one alone is enough.
  • Fold latency into the SLI. "Slow" is "down" more often than not.
  • If your SLI never fires, the SLI is wrong. Validate it against recent real incidents.
  • Trust breaks faster than numbers. An SLI design mistake becomes an organizational problem long before it becomes a technical one.
  • It's okay to lower the SLO. Starting from what reality says and ratcheting up is healthier than parading an impossible target.

FAQ

Q: Isn't treating SLO as a KPI to hit just standard practice? Why is that a problem?

A: When SLO is treated as a KPI to hit, the incentive when numbers diverge from reality is to protect the numbers, not to confront reality. That's how you end up with teams defending an SLI that no longer reflects user experience. SLO is meant to be a decision trigger: miss it, and you freeze features to invest in reliability; beat it, and you can take more risk. Hitting the number isn't the goal. Without that shift in how SLO is used, the same problem will recur under any SLI.

Q: Why is starting from server-side metrics a bad idea for an SLI?

A: It's not wrong, just often insufficient. Server-side metrics only see the status your own server returned, which tends to drift from user experience. Most outages show up as "200 with an error payload," "200 but slow," or "dependency down" — not as 5xx. Designing the SLI from the user's side upfront is cheaper than reconciling it later.

Q: Can I get away with just synthetic monitoring, or just RUM?

A: Neither alone is enough. Synthetic monitoring runs fixed scenarios on a fixed schedule — good coverage, but it can't reflect the variety of real user behavior. RUM captures real users, but it's weak in low-traffic hours. The combination is what gives you a useful picture of availability.

Q: Should latency be part of the availability SLI, or should I split it into a separate latency SLI?

A: Both approaches are valid. We folded latency into availability because we wanted a single "did the user succeed or not" signal. If you want to separately track "responses are fast but the feature is degrading," a dedicated latency SLI makes sense. Decide based on how you want to split your user experience, not on which approach is theoretically cleaner.

Q: If I change the SLI, I lose comparability with historical numbers. Is that a problem?

A: Comparability is lost, yes — but that's because the old SLI was measuring the wrong thing. During the transition, compute both in parallel and present them side by side: "old SLI says 99.95%, new SLI says 98.6%." That builds organizational understanding faster than hiding the divergence. Preserving a misleading metric for the sake of continuity does more damage long term.


Based on experience across multiple SRE engagements. Details that could identify specific companies or individuals have been omitted or generalized. Specific figures have been adjusted for clarity while preserving the essence of what happened.