Human-Calibrated Automated Testing and Validation of Generative Language Models
Agus Sudjianto, Aijun Zhang, Srinivas Neppalli, Tarun Joshi, Michal, Malohlava

TL;DR
This paper presents HCAT, a comprehensive, human-calibrated framework for evaluating generative language models, especially RAG systems, focusing on safety, robustness, and alignment with human judgments in high-stakes domains.
Contribution
It introduces a novel multi-layered evaluation framework combining automated testing, explainable metrics, and calibration methods to improve GLM assessment reliability and transparency.
Findings
Effective calibration aligns machine and human judgments.
Robustness testing identifies vulnerabilities to adversarial inputs.
Targeted weakness analysis pinpoints specific model shortcomings.
Abstract
This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Linear Warmup With Linear Decay · Layer Normalization · Byte Pair Encoding · Adam · Residual Connection · Weight Decay · Softmax
