Deploy, Calibrate, Monitor, Heal -- No Human Required: An Autonomous AI SRE Agent for Elasticsearch
Muhamed Ramees Cheriya Mukkolakkal

TL;DR
This paper introduces the ES Guardian Agent, an autonomous AI system that manages Elasticsearch clusters end-to-end, using multi-source telemetry and AI-driven failure prediction to achieve near-perfect availability without human intervention.
Contribution
It presents a novel multi-phase autonomous AI SRE agent capable of full lifecycle management of Elasticsearch, including failure prediction and proactive healing, with real-world deployment results.
Findings
Successfully recovered from an 18-hour outage autonomously.
Diagnosed hardware NIC failures across all nodes.
Achieved 99.9999% availability in production environment.
Abstract
Operating Elasticsearch clusters at scale demands continuous human expertise spanning the full lifecycle -- from initial deployment through performance tuning, monitoring, failure prediction, and incident recovery. We present the ES Guardian Agent, an autonomous AI SRE system that manages the complete Elasticsearch lifecycle without human intervention through eleven distinct phases: Evaluate, Optimize, Deploy, Calibrate, Stabilize, Alert, Predict, Heal, Learn, and Upgrade. A critical differentiator is its multi-source predictive failure engine, which continuously ingests and correlates metrics trends, application logs, and kernel-level telemetry -- including Linux dmesg streams, NVMe SMART data, NIC bond statistics, and thermal sensors -- to anticipate failures hours before they materialize. By cross-referencing current system signatures against a persistent incident memory of resolved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
