LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications
Alexandre Cristov\~ao Maiorano

TL;DR
This paper introduces a comprehensive readiness harness for LLM and RAG applications that integrates evaluation, observability, and CI gates to support deployment decisions based on multiple operational metrics.
Contribution
It presents a novel framework combining automated benchmarks, observability, and quality gates into a unified workflow for assessing LLM/RAG readiness.
Findings
Readiness scores effectively differentiate model performance and operational suitability.
The harness can identify unsafe prompt variants and prevent risky releases.
Evaluation on multiple datasets shows the system's robustness and comprehensive coverage.
Abstract
We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow. The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into scenario-weighted readiness scores with Pareto frontiers. We evaluate the harness on ticket-routing workflows and BEIR grounding tasks (SciFact and FiQA) with full Azure matrix coverage (162/162 valid cells across datasets, scenarios, retrieval depths, seeds, and models). Results show that readiness is not a single metric: on FiQA under sla-first at k=5, gpt-4.1-mini leads in readiness and faithfulness, while gpt-5.2 pays a substantial latency cost; on SciFact, models are closer in quality but still separable operationally. Ticket-routing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
