Your 24/7 On-Call AI Agent - Solve Alerts Faster with Automatic Correlations, Investigations, and More
-
Updated
Apr 25, 2025 - Python
Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.
Your 24/7 On-Call AI Agent - Solve Alerts Faster with Automatic Correlations, Investigations, and More
Making on-call suck less for engineers
OpenShift Guide. Learn about the Red Hat OpenShift Container Platform, Data Science, Code Ready Containers, Podman, Buildah, and Kubernetes.
Welcome To The World of DevOps. An ongoing & curated collection of awesome software, libraries, learning tutorials, tools and resources and cool stuff about DevOps.
[ASE'24][WWW'25] RCAEval: A Benchmark for Root Cause Analysis. https://doi.org/10.1145/3691620.3695065
The agent of Komlog, a PaaS for helping observability teams to better understand their systems.
Deterministic Subsetting as defined in the SRE book
Use Grafana k6, Dynatrace business events, workflows and site reliability guardian to validate software releases
Oncall Dashboard
An agent to monitor the uptime of multiple sites and response time without any external libraries.