Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating   RAG Systems

Rafael Teixeira de Lima (1); Shubham Gupta (1); Cesar Berrospi (2),; Lokesh Mishra (2); Michele Dolfi (2); Peter Staar (2); Panagiotis Vagenas (2); ((1) IBM Research Paris-Saclay; (2) IBM Research Zurich)

arXiv:2411.19710·cs.IR·December 2, 2024·2 cites

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Rafael Teixeira de Lima (1), Shubham Gupta (1), Cesar Berrospi (2),, Lokesh Mishra (2), Michele Dolfi (2), Peter Staar (2), Panagiotis Vagenas (2), ((1) IBM Research Paris-Saclay, (2) IBM Research Zurich)

PDF

Open Access

TL;DR

This paper addresses the challenges in evaluating RAG systems by proposing dataset characterization and targeted generation strategies, demonstrating that small fine-tuned LLMs can effectively produce quality Q&A datasets for better system assessment.

Contribution

It introduces a taxonomy for RAG datasets, highlights issues with current data generation methods, and proposes label-based characterization and fine-tuned LLMs for improved dataset creation.

Findings

01

Public Q&A datasets can mislead RAG performance evaluation.

02

Common dataset generation tools can produce unbalanced data.

03

Fine-tuned small LLMs can generate effective Q&A datasets.

Abstract

Retrieval Augmented Generation (RAG) systems are a widespread application of Large Language Models (LLMs) in the industry. While many tools exist empowering developers to build their own systems, measuring their performance locally, with datasets reflective of the system's use cases, is a technological challenge. Solutions to this problem range from non-specific and cheap (most public datasets) to specific and costly (generating data from local documents). In this paper, we show that using public question and answer (Q&A) datasets to assess retrieval performance can lead to non-optimal systems design, and that common tools for RAG dataset generation can lead to unbalanced data. We propose solutions to these issues based on the characterization of RAG datasets through labels and through label-targeted data generation. Finally, we show that fine-tuned small LLMs can efficiently generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems · Quality and Safety in Healthcare

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · Linear Warmup With Linear Decay · Linear Layer · Layer Normalization · WordPiece · Attention Dropout · Multi-Head Attention · Byte Pair Encoding