RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval   Augmented Question Answering

Rujun Han; Yuhao Zhang; Peng Qi; Yumo Xu; Jenyuan Wang; Lan Liu,; William Yang Wang; Bonan Min; Vittorio Castelli

arXiv:2407.13998·cs.CL·October 4, 2024

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu,, William Yang Wang, Bonan Min, Vittorio Castelli

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces RAG-QA Arena, a new benchmark for evaluating domain robustness in long-form retrieval-augmented question answering, addressing limitations of existing datasets by including diverse, long-form answers across multiple domains.

Contribution

The paper creates Long-form RobustQA, a comprehensive dataset with human-written answers from multiple domains, and proposes RAG-QA Arena for direct evaluation of RAG-QA systems using LLMs as evaluators.

Findings

01

RAG-QA Arena correlates well with human judgment.

02

Only 41.3% of top LLM answers outperform LFRQA answers.

03

The dataset covers 26K queries across seven domains.

Abstract

Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA's answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

awslabs/rag-qa-arena
noneOfficial

Videos

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering· underline

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Speech and dialogue systems