Evaluating Quality of Answers for Retrieval-Augmented Generation: A   Strong LLM Is All You Need

Yang Wang; Alberto Garcia Hernandez; Roman Kyslyi; Nicholas Kersting

arXiv:2406.18064·cs.CL·November 8, 2024·3 cites

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

Yang Wang, Alberto Garcia Hernandez, Roman Kyslyi, Nicholas Kersting

PDF

Open Access

TL;DR

This paper introduces vRAG-Eval, a new grading system for assessing answer quality in Retrieval-Augmented Generation, demonstrating that GPT-4 can reliably evaluate answers with high agreement to human judgments, especially in factual, closed-domain contexts.

Contribution

The paper presents vRAG-Eval, a novel evaluation framework for RAG answer quality, and shows GPT-4's evaluations align closely with human experts, highlighting LLMs as effective evaluators.

Findings

01

vRAG-Eval effectively assesses correctness, completeness, and honesty.

02

GPT-4 achieves 83% agreement with human judgments.

03

LLMs can serve as reliable evaluators in resource-intensive settings.

Abstract

We present a comprehensive study of answer quality evaluation in Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel grading system that is designed to assess correctness, completeness, and honesty. We further map the grading of quality aspects aforementioned into a binary score, indicating an accept or reject decision, mirroring the intuitive "thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This approach suits factual business contexts where a clear decision opinion is essential. Our assessment applies vRAG-Eval to two Large Language Models (LLMs), evaluating the quality of answers generated by a vanilla RAG application. We compare these evaluations with human expert judgments and find a substantial alignment between GPT-4's assessments and those of human experts, reaching 83% agreement on accept or reject decisions. This study highlights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Information Retrieval and Search Behavior

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · WordPiece · Softmax · Layer Normalization · Linear Warmup With Linear Decay · Byte Pair Encoding · Attention Dropout · Dropout