Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Federica Gamba; Aman Sinha; Timothee Mickus; Raul Vazquez; Patanjali Bhamidipati; Claudio Savelli; Ahana Chattopadhyay; Laura A. Zanella; Yash Kankanampati; Binesh Arakkal Remesh; Aryan Ashok Chandramania; Rohit Agarwal; Chuyuan Li; Ioana Buhnila; Radhika Mamidi

arXiv:2510.22395·cs.CL·October 28, 2025

Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Federica Gamba, Aman Sinha, Timothee Mickus, Raul Vazquez, Patanjali Bhamidipati, Claudio Savelli, Ahana Chattopadhyay, Laura A. Zanella, Yash Kankanampati, Binesh Arakkal Remesh, Aryan Ashok Chandramania, Rohit Agarwal, Chuyuan Li, Ioana Buhnila, Radhika Mamidi

PDF

TL;DR

The CAP dataset provides a multilingual, scientifically focused benchmark for detecting hallucinations in large language models, aiding research in factual accuracy and multilingual NLP in scientific contexts.

Contribution

This paper introduces the CAP dataset, a large, multilingual collection of scientific question-answer pairs with annotations for hallucinations and fluency, specifically designed for scientific LLM evaluation.

Findings

01

900 curated scientific questions and 7000+ LLM answers

02

Annotations for factuality errors and fluency issues

03

Cross-lingual coverage with 9 languages

Abstract

We introduce the CAP (Confabulations from ACL Publications) dataset, a multilingual resource for studying hallucinations in large language models (LLMs) within scientific text generation. CAP focuses on the scientific domain, where hallucinations can distort factual knowledge, as they frequently do. In this domain, however, the presence of specialized terminology, statistical reasoning, and context-dependent interpretations further exacerbates these distortions, particularly given LLMs' lack of true comprehension, limited contextual understanding, and bias toward surface-level generalization. CAP operates in a cross-lingual setting covering five high-resource languages (English, French, Hindi, Italian, and Spanish) and four low-resource languages (Bengali, Gujarati, Malayalam, and Telugu). The dataset comprises 900 curated scientific questions and over 7000 LLM-generated answers from 16…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.