RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen   Reference Content

Joao Monteiro; Pierre-Andre Noel; Etienne Marcotte; Sai Rajeswar,; Valentina Zantedeschi; David Vazquez; Nicolas Chapados; Christopher Pal,; Perouz Taslakian

arXiv:2406.11811·cs.CL·November 6, 2024

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

Joao Monteiro, Pierre-Andre Noel, Etienne Marcotte, Sai Rajeswar,, Valentina Zantedeschi, David Vazquez, Nicolas Chapados, Christopher Pal,, Perouz Taslakian

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

RepLiQA is a new dataset designed to evaluate large language models on question-answering tasks using unseen, human-crafted reference documents, reducing the risk of data leakage and providing a more accurate assessment of model capabilities.

Contribution

The paper introduces RepLiQA, a novel benchmark dataset with unseen reference content, enabling more reliable evaluation of LLMs on question-answering and topic retrieval tasks.

Findings

01

State-of-the-art LLMs show varied performance on RepLiQA.

02

Models perform better when relevant content is explicitly provided.

03

RepLiQA reduces data leakage issues in model evaluation.

Abstract

Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ServiceNow/repliqa
noneOfficial

Datasets

ServiceNow/repliqa
dataset· 4.6k dl
4.6k dl

Videos

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Library Science and Information Systems · Biomedical Text Mining and Ontologies

MethodsSparse Evolutionary Training