NeoQA: Evidence-based Question Answering with Generated News Events

Max Glockner; Xiang Jiang; Leonardo F. R. Ribeiro; Iryna Gurevych; Markus Dreyer

arXiv:2505.05949·cs.CL·May 12, 2025

NeoQA: Evidence-based Question Answering with Generated News Events

Max Glockner, Xiang Jiang, Leonardo F. R. Ribeiro, Iryna Gurevych, Markus Dreyer

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

NeoQA is a novel benchmark for evaluating evidence-based question answering in large language models, using fictional news events to ensure models rely solely on retrieved evidence rather than pretraining knowledge.

Contribution

The paper introduces NeoQA, a new dataset and benchmark that enables controlled, evidence-based evaluation of LLMs using generated fictional news data to prevent pretraining knowledge influence.

Findings

01

LLMs struggle with subtle mismatches between questions and evidence.

02

Models exhibit shortcut reasoning when key evidence is missing.

03

NeoQA provides a controlled environment for evaluating evidence reliance.

Abstract

Evaluating Retrieval-Augmented Generation (RAG) in large language models (LLMs) is challenging because benchmarks can quickly become stale. Questions initially requiring retrieval may become answerable from pretraining knowledge as newer models incorporate more recent information during pretraining, making it difficult to distinguish evidence-based reasoning from recall. We introduce NeoQA (News Events for Out-of-training Question Answering), a benchmark designed to address this issue. To construct NeoQA, we generated timelines and knowledge bases of fictional news events and entities along with news articles and Q\&A pairs to prevent LLMs from leveraging pretraining knowledge, ensuring that no prior evidence exists in their training data. We propose our dataset as a new platform for evaluating evidence-based question answering, as it requires LLMs to generate responses exclusively from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/neoqa
noneOfficial

Datasets

mglockner/neoqa
dataset· 95 dl
95 dl

Videos

NeoQA: Evidence-based Question Answering with Generated News Events· underline

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Computational and Text Analysis Methods