Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need
Yang Wang, Alberto Garcia Hernandez, Roman Kyslyi, Nicholas Kersting

TL;DR
This paper introduces vRAG-Eval, a new grading system for assessing answer quality in Retrieval-Augmented Generation, demonstrating that GPT-4 can reliably evaluate answers with high agreement to human judgments, especially in factual, closed-domain contexts.
Contribution
The paper presents vRAG-Eval, a novel evaluation framework for RAG answer quality, and shows GPT-4's evaluations align closely with human experts, highlighting LLMs as effective evaluators.
Findings
vRAG-Eval effectively assesses correctness, completeness, and honesty.
GPT-4 achieves 83% agreement with human judgments.
LLMs can serve as reliable evaluators in resource-intensive settings.
Abstract
We present a comprehensive study of answer quality evaluation in Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel grading system that is designed to assess correctness, completeness, and honesty. We further map the grading of quality aspects aforementioned into a binary score, indicating an accept or reject decision, mirroring the intuitive "thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This approach suits factual business contexts where a clear decision opinion is essential. Our assessment applies vRAG-Eval to two Large Language Models (LLMs), evaluating the quality of answers generated by a vanilla RAG application. We compare these evaluations with human expert judgments and find a substantial alignment between GPT-4's assessments and those of human experts, reaching 83% agreement on accept or reject decisions. This study highlights…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Information Retrieval and Search Behavior
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · WordPiece · Softmax · Layer Normalization · Linear Warmup With Linear Decay · Byte Pair Encoding · Attention Dropout · Dropout
