Validating Alerts in Cloud-Native Observability

Maria C. Borges; Julian Legler; Lucca Di Benedetto

arXiv:2510.23970·cs.SE·October 29, 2025

Validating Alerts in Cloud-Native Observability

Maria C. Borges, Julian Legler, Lucca Di Benedetto

PDF

TL;DR

This paper presents a new extension for the OXN observability tool that enables early-stage design, tuning, and validation of alerts to improve reliability and reduce false alarms in cloud-native systems.

Contribution

It introduces a systematic approach and tool extension for testing and validating alert rules during development, addressing a gap in current observability practices.

Findings

01

Enables early alert tuning during development.

02

Reduces false alarms and improves reliability.

03

Supports systematic validation of alert behavior.

Abstract

Observability and alerting form the backbone of modern reliability engineering. Alerts help teams catch faults early before they turn into production outages and serve as first clues for troubleshooting. However, designing effective alerts is challenging. They need to strike a fine balance between catching issues early and minimizing false alarms. On top of this, alerts often cover uncommon faults, so the code is rarely executed and therefore rarely checked. To address these challenges, several industry practitioners advocate for testing alerting code with the same rigor as application code. Still, there's a lack of tools that support such systematic design and validation of alerts. This paper introduces a new alerting extension for the observability experimentation tool OXN. It lets engineers experiment with alerts early during development. With OXN, engineers can now tune rules at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.