Beyond dashboards: How AI is redefining the SRE Playbook

Beyond dashboards: How AI is redefining the SRE Playbook
In my last post, we examined why the future is observable. Now we explore why the future must be intelligent.
In my previous article, I discussed how the observability space is ripe for disruption with newcomers like Tsuga or Dash0 leading the charge.
Whilst digging into this category, another concept arose as key in this market: remediation. Whenever we’ve spoken to a technical leader or industry expert, the same comment has come back over and over: observability serves one major purpose: to maintain your infrastructure and ensure that your systems don’t go down. If they do, you need to be able to remediate quickly. That measure is called mean time to repair/recover (MTTR).

Over the last 12 months, several companies have emerged in this category, raising huge amounts to improve the MTTR and build on top of existing observability solutions. Those include Resolve AI, Traversal AI, TierZero…
These solutions are more generally called AI SRE companies.
What is the role of a Site Reliability Engineer (SRE)?
Today, a Site Reliability Engineer (SRE) bridges the gap between development and operations, treating infrastructure as code to ensure reliability and scalability. Their role has fundamentally shifted from reactive “gatekeepers” to proactive engineers, who design systems to prevent failure.
Before observability, SREs relied on static monitoring — simply checking if a server was “up” — often leaving them blind to the cause of failures in distributed microservices.
With the advent of observability, the role has evolved into “investigative engineering.” SREs now ask why a system is behaving anomalously rather than just when it breaks. This visibility allows them to define precise Service Level Objectives (SLOs) based on real user experience rather than arbitrary stats. Instead of fighting fires, they use deep insights to optimize performance and prevent failures before they impact the business.
However, as data volumes explode and teams begin drowning in dashboards where metrics multiply like rabbits, the role is shifting again.

We are entering the “Observability 2.0” era, where AI agents don’t just act as copilots but help SREs with better analytics and faster troubleshooting. By moving from raw telemetry to contextual insights, AI is enabling a transition from simple detection to automated remediation — redefining the economics of how systems are understood and allowing SREs to finally close the loop on reliability.
AI is coming to automate parts of this workflow
However, given how critical incident response is today, large enterprises are still hesitant to automate processes, as they don’t trust the underlying technology and require products to keep a human in the loop. Therefore, companies like Resolve AI or Phoebe act as support systems to enable DevOps / SRE teams to make decisions faster, allowing them to become proactive and prevent outages based on anomalies.
Today, we believe that AI is good enough to be an assistant, but not a solo pilot…yet. AI is really good at pattern matching but still struggles with “creative debugging,” i.e., when the problem is a black swan and does not match a pattern. In those situations, it is likely to hallucinate, focus on a single signal, and ignore more subtle ones. According to current benchmarks, AI agents achieve about 45–60% accuracy in identifying root causes autonomously, compared to 60–75% for senior SREs. However, when the two work together, the accuracy jumps to over 90% while cutting the time spent by half. Source: https://github.com/phamquiluan/RCAEval

DevOps is moving from being a “gatekeeper” to an “enabler.”
We believe that engineering teams are entering a new era of evolution where the traditional DevOps bottleneck is permanently dissolved. This shift gives developers unprecedented control over their own workflows, empowering them to manage QA and debugging autonomously rather than waiting in a queue for external support.
By removing the friction of constant hand-offs, we aren’t just shifting tasks; we are creating a more fluid, high-velocity environment. This allows SRE and DevOps teams to move away from reactive troubleshooting and focus on what truly matters: architecting resilient core systems and ensuring the highest possible quality of service. In this better way of working, the “bottleneck” disappears, replaced by a partnership where everyone is empowered to focus on their highest-value work.

While we believe that enterprise adoption of AI-powered SRE tools is not yet widespread, we are convinced that these platforms will shape the future of DevOps, enabling engineering teams to build faster and higher-quality software.