Agentic Observability: Automated Alert Triage for Adobe E-Commerce
Aprameya Bharadwaj, Kyle Tu

TL;DR
This paper introduces an agentic observability framework that autonomously performs alert triage in Adobe's e-commerce system, significantly reducing incident resolution time through AI-driven analysis and actions.
Contribution
It presents a novel autonomous alert triage system using a ReAct paradigm, improving efficiency and accuracy in enterprise incident response.
Findings
90% reduction in mean time to insight
Order-of-magnitude decrease in triage latency
Maintains diagnostic accuracy comparable to manual methods
Abstract
Modern enterprise systems exhibit complex interdependencies that make observability and incident response increasingly challenging. Manual alert triage, which typically involves log inspection, API verification, and cross-referencing operational knowledge bases, remains a major bottleneck in reducing mean recovery time (MTTR). This paper presents an agentic observability framework deployed within Adobe's e-commerce infrastructure that autonomously performs alert triage using a ReAct paradigm. Upon alert detection, the agent dynamically identifies the affected service, retrieves and analyzes correlated logs across distributed systems, and plans context-dependent actions such as handbook consultation, runbook execution, or retrieval-augmented analysis of recently deployed code. Empirical results from production deployment indicate a 90% reduction in mean time to insight compared to manual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSoftware System Performance and Reliability · Mobile Agent-Based Network Management · Advanced Software Engineering Methodologies
