GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
Sacha Muller, Ant\'onio Loison, Bilel Omrani, Gautier Viaud

TL;DR
This paper introduces GroUSE, a benchmark for evaluating the performance of judge models in grounded question answering, revealing current limitations and proposing improvements including fine-tuning models for better evaluation accuracy.
Contribution
The paper presents GroUSE, a comprehensive benchmark with unit tests for evaluating judge models, and demonstrates that fine-tuning Llama-3 enhances evaluation performance.
Findings
Existing evaluation frameworks often miss critical failure modes.
GPT-4's judgments are not fully indicative of practical judge performance.
Fine-tuning Llama-3 on GPT-4 traces improves evaluation accuracy.
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases. In this work, we address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems. To assess the calibration and discrimination capabilities of judge models, we identify 7 generator failure modes and introduce GroUSE (Grounded QA Unitary Scoring of Evaluators), a meta-evaluation benchmark of 144 unit tests. This benchmark reveals that existing automated RAG evaluation frameworks often overlook important failure modes, even when using GPT-4 as a judge. To improve on the current design of automated RAG evaluation frameworks, we propose a novel pipeline and find that while closed models perform well on GroUSE, state-of-the-art open-source judges do not generalize to our proposed criteria,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · WordPiece
