Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework
Zackary Rackauckas, Arthur C\^amara, Jakub Zavrel

TL;DR
This paper introduces RAGElo, an automated Elo-based framework leveraging LLMs for evaluating and ranking Retrieval-Augmented Generation systems in domain-specific QA tasks, addressing hallucination and benchmarking challenges.
Contribution
It presents a comprehensive, LLM-based evaluation framework for RAG systems, including synthetic data generation, LLM-as-judge, and Elo ranking, specifically applied to domain-specific QA.
Findings
RAGF outperforms RAG in Elo score and answer completeness.
RAGElo's rankings correlate positively with human judgments.
RAGF produces more complete and relevant answers.
Abstract
Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInertial Sensor and Navigation
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · WordPiece · Softmax · Layer Normalization · Linear Warmup With Linear Decay · Byte Pair Encoding · Attention Dropout · Dropout
