Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models

Shuo Zhang; Fabrizio Gotti; Fengran Mo; Jian-Yun Nie

arXiv:2511.17946·cs.CL·November 25, 2025

Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models

Shuo Zhang, Fabrizio Gotti, Fengran Mo, Jian-Yun Nie

PDF

Open Access

TL;DR

This paper investigates whether lexical coverage of training data can serve as an additional signal for detecting hallucinations in large language models, showing modest improvements when combined with existing methods.

Contribution

It introduces a scalable method to analyze lexical data coverage and evaluates its effectiveness as a hallucination detection signal across multiple benchmarks.

Findings

01

Lexical coverage features provide a modest detection signal.

02

Combining coverage features with log-probabilities improves detection.

03

Coverage features are more effective on datasets with higher model uncertainty.

Abstract

Hallucination in large language models (LLMs) is a fundamental challenge, particularly in open-domain question answering. Prior work attempts to detect hallucination with model-internal signals such as token-level entropy or generation consistency, while the connection between pretraining data exposure and hallucination is underexplored. Existing studies show that LLMs underperform on long-tail knowledge, i.e., the accuracy of the generated answer drops for the ground-truth entities that are rare in pretraining. However, examining whether data coverage itself can serve as a detection signal is overlooked. We propose a complementary question: Does lexical training-data coverage of the question and/or generated answer provide additional signal for hallucination detection? To investigate this, we construct scalable suffix arrays over RedPajama's 1.3-trillion-token pretraining corpus to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Mental Health via Writing · Text Readability and Simplification