GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question   Answering

Sacha Muller; Ant\'onio Loison; Bilel Omrani; Gautier Viaud

arXiv:2409.06595·cs.CL·January 31, 2025

GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering

Sacha Muller, Ant\'onio Loison, Bilel Omrani, Gautier Viaud

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces GroUSE, a benchmark for evaluating the performance of judge models in grounded question answering, revealing current limitations and proposing improvements including fine-tuning models for better evaluation accuracy.

Contribution

The paper presents GroUSE, a comprehensive benchmark with unit tests for evaluating judge models, and demonstrates that fine-tuning Llama-3 enhances evaluation performance.

Findings

01

Existing evaluation frameworks often miss critical failure modes.

02

GPT-4's judgments are not fully indicative of practical judge performance.

03

Fine-tuning Llama-3 on GPT-4 traces improves evaluation accuracy.

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases. In this work, we address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems. To assess the calibration and discrimination capabilities of judge models, we identify 7 generator failure modes and introduce GroUSE (Grounded QA Unitary Scoring of Evaluators), a meta-evaluation benchmark of 144 unit tests. This benchmark reveals that existing automated RAG evaluation frameworks often overlook important failure modes, even when using GPT-4 as a judge. To improve on the current design of automated RAG evaluation frameworks, we propose a novel pipeline and find that while closed models perform well on GroUSE, state-of-the-art open-source judges do not generalize to our proposed criteria,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

illuin-tech/grouse
noneOfficial

Datasets

illuin/grouse
dataset· 168 dl
168 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExpert finding and Q&A systems · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · WordPiece