Human-Calibrated Automated Testing and Validation of Generative Language   Models

Agus Sudjianto; Aijun Zhang; Srinivas Neppalli; Tarun Joshi; Michal; Malohlava

arXiv:2411.16391·cs.CL·December 10, 2024·2 cites

Human-Calibrated Automated Testing and Validation of Generative Language Models

Agus Sudjianto, Aijun Zhang, Srinivas Neppalli, Tarun Joshi, Michal, Malohlava

PDF

Open Access

TL;DR

This paper presents HCAT, a comprehensive, human-calibrated framework for evaluating generative language models, especially RAG systems, focusing on safety, robustness, and alignment with human judgments in high-stakes domains.

Contribution

It introduces a novel multi-layered evaluation framework combining automated testing, explainable metrics, and calibration methods to improve GLM assessment reliability and transparency.

Findings

01

Effective calibration aligns machine and human judgments.

02

Robustness testing identifies vulnerabilities to adversarial inputs.

03

Targeted weakness analysis pinpoints specific model shortcomings.

Abstract

This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Linear Warmup With Linear Decay · Layer Normalization · Byte Pair Encoding · Adam · Residual Connection · Weight Decay · Softmax