Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems
Mukul Chhabra, Luigi Medrano, Arush Verma

TL;DR
This paper introduces a case-aware evaluation framework for enterprise multi-turn RAG systems, focusing on operational constraints and case-specific failure modes, enabling more accurate assessment and improvement of these systems.
Contribution
It presents a novel multi-metric evaluation framework tailored for enterprise RAG systems, addressing limitations of existing benchmarks and enabling scalable, diagnostic assessment.
Findings
Generic proxy metrics are ambiguous for enterprise RAG evaluation.
The framework exposes critical tradeoffs in system performance.
Deterministic JSON prompting enables scalable batch evaluation.
Abstract
Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error codes, versions), and resolution workflows. Existing RAG evaluation frameworks are primarily designed for benchmark-style or single-turn settings and often fail to capture enterprise-specific failure modes such as case misidentification, workflow misalignment, and partial resolution across turns. We present a case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems. The framework evaluates each turn using eight operationally grounded metrics that separate retrieval quality, grounding fidelity, answer utility, precision integrity, and case/workflow alignment. A severity-aware scoring protocol reduces score inflation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Engineering Research · Business Process Modeling and Analysis
