How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

JV Roig

arXiv:2603.08274·cs.CL·March 10, 2026

How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

JV Roig

PDF

Open Access

TL;DR

This study quantifies hallucination rates in large language models during document question answering, revealing that fabrication increases with context length and varies by model, temperature, and hardware, using a novel deterministic evaluation method.

Contribution

Introduces RIKER, a ground-truth-based evaluation approach for deterministic, scalable measurement of hallucinations in LLMs across diverse models and hardware platforms.

Findings

01

Hallucination rates increase with context length, reaching over 10% at 200K tokens.

02

Model choice significantly impacts hallucination resistance, more than size or temperature.

03

Grounding ability and fabrication resistance are distinct, with some models excelling at one but not the other.

Abstract

How much do large language models actually hallucinate when answering questions grounded in provided documents? Despite the critical importance of this question for enterprise AI deployments, reliable measurement has been hampered by benchmarks that rely on static datasets vulnerable to contamination, LLM-based judges with documented biases, or evaluation scales too small for statistical confidence. We address this gap using RIKER, a ground-truth-first evaluation methodology that enables deterministic scoring without human annotation. Across 35 open-weight models, three context lengths (32K, 128K, and 200K tokens), four temperature settings, and three hardware platforms (NVIDIA H200, AMD MI300X, and Intel Gaudi 3), we conducted over 172 billion tokens of evaluation - an order of magnitude beyond prior work. Our findings reveal that: (1) even the best-performing models fabricate answers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Materials Science