RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
Joao Monteiro, Pierre-Andre Noel, Etienne Marcotte, Sai Rajeswar,, Valentina Zantedeschi, David Vazquez, Nicolas Chapados, Christopher Pal,, Perouz Taslakian

TL;DR
RepLiQA is a new dataset designed to evaluate large language models on question-answering tasks using unseen, human-crafted reference documents, reducing the risk of data leakage and providing a more accurate assessment of model capabilities.
Contribution
The paper introduces RepLiQA, a novel benchmark dataset with unseen reference content, enabling more reliable evaluation of LLMs on question-answering and topic retrieval tasks.
Findings
State-of-the-art LLMs show varied performance on RepLiQA.
Models perform better when relevant content is explicitly provided.
RepLiQA reduces data leakage issues in model evaluation.
Abstract
Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Library Science and Information Systems · Biomedical Text Mining and Ontologies
MethodsSparse Evolutionary Training
