Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications
Alexandre Cristov\~ao Maiorano

TL;DR
This paper introduces an automated self-testing framework with evidence-based quality gates for managing the release of LLM applications, ensuring stability and safety through empirical evaluation.
Contribution
It presents a novel, evidence-driven release management system for LLMs that incorporates multiple evaluation dimensions and supports independent validation.
Findings
The framework effectively identified regressions and supported stable quality evolution.
Evidence coverage is the key discriminator for severe regressions.
Runtime scales predictably with suite size, enabling efficient scaling.
Abstract
LLM applications are AI systems whose nondeterministic outputs and evolving model behavior make traditional testing insufficient for release governance. We present an automated self-testing framework that introduces quality gates with evidence-based release decisions (PROMOTE/HOLD/ROLLBACK) across five empirically grounded dimensions: task success rate, research context preservation, P95 latency, safety pass rate, and evidence coverage. We evaluate the framework through a longitudinal case study of an internally deployed multi-agent conversational AI system with specific marketing capabilities in active development, covering 38 evaluation runs across 20+ internal releases. The gate identified two ROLLBACK-grade builds in early runs and supported stable quality evolution over a four-week staging lifecycle while exercising persona-grounded, multi-turn, adversarial, and evidence-required…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Adversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education
