Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation
Gauthier Guinet, Behrooz Omidvar-Tehrani, Anoop Deoras, Laurent Callot

TL;DR
This paper introduces an automated, cost-effective method using synthetic exams and Item Response Theory to evaluate and improve retrieval-augmented language models' task-specific accuracy across diverse datasets.
Contribution
The authors develop a novel evaluation framework combining synthetic exam generation with IRT to assess and enhance RAG models' performance in a task-specific manner.
Findings
Retrieval algorithms significantly impact RAG performance more than model size.
The proposed method effectively identifies informative exam questions for model evaluation.
Insights into factors affecting RAG performance, such as retrieval mechanism and prompting strategies.
Abstract
We propose a new method to measure the task-specific accuracy of Retrieval-Augmented Large Language Models (RAG). Evaluation is performed by scoring the RAG on an automatically-generated synthetic exam composed of multiple choice questions based on the corpus of documents associated with the task. Our method is an automated, cost-efficient, interpretable, and robust strategy to select the optimal components for a RAG system. We leverage Item Response Theory (IRT) to estimate the quality of an exam and its informativeness on task-specific accuracy. IRT also provides a natural way to iteratively improve the exam by eliminating the exam questions that are not sufficiently informative about a model's ability. We demonstrate our approach on four new open-ended Question-Answering tasks based on Arxiv abstracts, StackExchange questions, AWS DevOps troubleshooting guides, and SEC filings. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Linear Warmup With Linear Decay · Attention Dropout · Linear Layer · Multi-Head Attention · Residual Connection · Weight Decay · Byte Pair Encoding
