On many engineering teams, one of the biggest challenges we face is ensuring that we meet the expectations of both our product team and our customers. This challenge becomes even more complex when the perceptions of key concepts like availability, reliability, and uptime differ between engineering and those outside of our direct sphere.
Recently, I’ve observed that a disconnect can exist between how we in engineering measure and define these terms and how our product team—and more importantly, our customers—view them. This misalignment can become particularly evident when we evaluate and review our ability to respond to incidents and outages.
The Perception Gap
From an engineering perspective, we often pride ourselves on maintaining what we believe to be solid uptime and reliability metrics. We have systems in place to monitor availability, detect incidents, and track performance. However, when you dig deeper, it can become evident that at times we may be that what we’ve measured as success was not perceived as such by our product team or our customers.
For example, engineering might define availability as the time our core systems are operational. But from a customer’s perspective, if they’re unable to access a key feature—even if our infrastructure is technically available—they see it as downtime. If your software has a dependency upon an external party or service, and that becomes unreliable or unavailable, if it is not in engineering's direct control, it can be perceived as a downtime. Similarly, we might perceive an event as an "incident," while product or customers consider it an "outage." The language and definitions may not always be aligned, and the consequence is that the expectations for how incidents should be handled may not consistently met.
Defining Incidents vs. Outages
Part of the challenge is that we may not have a clear, shared understanding of what constitutes an “incident” versus an “outage.” In engineering, we often categorize based on severity, duration, or scope. But these definitions can be vague or inconsistent.
An "incident" might involve degraded performance or a minor issue, while an "outage" indicates a complete system failure. However, without a standard definition or clear thresholds that are shared across teams, confusion ensues. And when customers experience any disruption—whether minor or major—they’re not concerned with our internal distinctions. To them, it’s all downtime.
Building the Right Metrics
The first step in solving this disconnect is adjusting how we measure reliability and uptime. We need metrics that reflect customer and product expectations, not just our own.
This might involve:
- Measuring feature-level availability rather than just system-wide availability. Even if our backend is stable, if key features are down, that’s what impacts users.
- Tracking degraded performance as downtime rather than just complete outages. A slow system is still a poor experience for the user, even if it's technically operational.
- Incorporating customer impact into incident severity classification. How many users were affected? What was the business impact? These questions should factor into how we measure incidents.
Improving Incident Response
The next critical area is how we respond to incidents. We’ve realized that the way we identify, triage, escalate, and communicate during incidents was not meeting product and customer expectations. In many cases, the engineering response felt adequate internally, but externally, there were frustrations about the speed of communication, clarity of action steps, and transparency of resolution.
Here are some things we’re focusing on to improve:
- Clear Incident Definitions and Processes: We’re working to solidify definitions of incidents vs. outages with clear thresholds and expectations for response times and escalation. Everyone from engineering to product to customer support needs to be on the same page.
- Real-time Communication: Timely and transparent communication with stakeholders is essential. Even if we don’t have all the answers yet, we need to provide updates so that product and customers know we’re actively addressing the issue.
- Defined Escalation Paths: Having clear processes for escalation ensures that the right teams are involved quickly. If an incident is not being resolved at a lower level, it needs to be escalated immediately to avoid delays in resolution. Making sure that the right people are involved to efficiently address escalation is critical.
- Post-Incident Reviews: Conducting thorough post-incident reviews can help us identify not only what went wrong but also how the response could have been improved. This ensures continuous learning and adaptation. We're looking for opportunity to learn lessons, and while we don't want to be ceremonial about having post incident reviews, we've learned that we should do them more often than we have been to better learn.
Meeting Expectations
Ultimately, to meet the expectations of product and customers, engineering teams need to adapt their mindset. It’s not enough to build systems that are robust from a technical perspective. We need to align our understanding of availability and reliability with those who rely on our systems every day.
Here are a few things that can help bridge this gap:
- Collaborate with Product Early: Working closely with the product team ensures that our definitions, goals, and metrics for reliability are aligned from the start. This also allows us to address any potential disconnects early in the process.
- Customer-Centric Metrics: Design metrics and processes with the end-user in mind. Availability and reliability should be measured based on how customers experience the product, not just how we monitor backend systems.
- Focus on Responsiveness: Beyond technical solutions, focus on how we respond to issues. Speed, transparency, and clear communication are key to ensuring that incidents don’t erode trust between engineering, product, and customers.
Conclusion
The challenge of aligning engineering, product, and customer expectations around availability and incident response has been an eye-opening experience. It has taught us that technical excellence must be paired with customer-centric thinking. By adjusting our metrics and improving how we respond to incidents, we can ensure that we’re not just building reliable systems but that we’re also meeting the needs of the people who rely on them every day.