Automated Evaluation of Retrieval-Augmented Language Models with   Task-Specific Exam Generation

Gauthier Guinet; Behrooz Omidvar-Tehrani; Anoop Deoras; Laurent Callot

arXiv:2405.13622·cs.CL·May 24, 2024·6 cites

Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

Gauthier Guinet, Behrooz Omidvar-Tehrani, Anoop Deoras, Laurent Callot

PDF

Open Access 1 Repo

TL;DR

This paper introduces an automated, cost-effective method using synthetic exams and Item Response Theory to evaluate and improve retrieval-augmented language models' task-specific accuracy across diverse datasets.

Contribution

The authors develop a novel evaluation framework combining synthetic exam generation with IRT to assess and enhance RAG models' performance in a task-specific manner.

Findings

01

Retrieval algorithms significantly impact RAG performance more than model size.

02

The proposed method effectively identifies informative exam questions for model evaluation.

03

Insights into factors affecting RAG performance, such as retrieval mechanism and prompting strategies.

Abstract

We propose a new method to measure the task-specific accuracy of Retrieval-Augmented Large Language Models (RAG). Evaluation is performed by scoring the RAG on an automatically-generated synthetic exam composed of multiple choice questions based on the corpus of documents associated with the task. Our method is an automated, cost-efficient, interpretable, and robust strategy to select the optimal components for a RAG system. We leverage Item Response Theory (IRT) to estimate the quality of an exam and its informativeness on task-specific accuracy. IRT also provides a natural way to iteratively improve the exam by eliminating the exam questions that are not sufficiently informative about a model's ability. We demonstrate our approach on four new open-ended Question-Answering tasks based on Arxiv abstracts, StackExchange questions, AWS DevOps troubleshooting guides, and SEC filings. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/auto-rag-eval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Linear Warmup With Linear Decay · Attention Dropout · Linear Layer · Multi-Head Attention · Residual Connection · Weight Decay · Byte Pair Encoding