When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs

Shaowen Wang; Yiqi Dong; Ruinian Chang; Tansheng Zhu; Yuebo Sun; Kaifeng Lyu; Jian Li

arXiv:2511.07318·cs.CL·November 24, 2025

When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs

Shaowen Wang, Yiqi Dong, Ruinian Chang, Tansheng Zhu, Yuebo Sun, Kaifeng Lyu, Jian Li

PDF

Open Access 3 Reviews

TL;DR

This paper reveals that spurious correlations in training data cause hallucinations in large language models, which current detection methods fail to identify, highlighting the need for new mitigation strategies.

Contribution

It uncovers the impact of spurious correlations on hallucination detection in LLMs and provides systematic experiments and theoretical insights demonstrating their resilience against existing defenses.

Findings

01

Spurious correlations induce confidently generated hallucinations.

02

Current detection methods fail against spurious correlation-driven hallucinations.

03

Spurious biases persist even after fine-tuning and scaling.

Abstract

Despite substantial advances, large language models (LLMs) continue to exhibit hallucinations, generating plausible yet incorrect responses. In this paper, we highlight a critical yet previously underexplored class of hallucinations driven by spurious correlations -- superficial but statistically prominent associations between features (e.g., surnames) and attributes (e.g., nationality) present in the training data. We demonstrate that these spurious correlations induce hallucinations that are confidently generated, immune to model scaling, evade current detection methods, and persist even after refusal fine-tuning. Through systematically controlled synthetic experiments and empirical evaluations on state-of-the-art open-source and proprietary LLMs (including GPT-5), we show that existing hallucination detection methods, such as confidence-based filtering and inner-state probing,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 3

Strengths

1. The paper focuses on a targeted research question, and does a good job of providing an in-depth discussion. 2. The theoretical discussion and the controlled experiments with synthetic data across a wide range of models both strongly support the hypothesis proposed in the paper. 3. The paper is well written, and I enjoyed reading it. The Related Work is mostly well done, the Theoretical discussion is broadly easy to follow (although it can be made more accessible), and the experiment details a

Weaknesses

1. My only complaint, the 'real world validation' is not as strong, and thus leaves open the question of how to actually recognize spurious correlations in the wild. It is not clear to me how the Jaccard similarity of entities between the question and the generated answer is a measure of spurious correlations. While I do believe this is a good paper, even without the real world validation, clarification of these experiments would be appreciated. In their current form, I'm not convinced these res

Reviewer 02Rating 4Confidence 2

Strengths

This paper provides rigorous theoretical derivation supporting the their hypothesis. They propose well-controlled synthetic experiments and pair with evolution on real LLMs. The paper demonstrates consistent empirical trend showing the degradation of detection under strong spurious correlations.The paper identifies an important failure mode of hallucination detection in large language models (LLMs), attributing it to spurious correlations in training data. It provides a rigorous theoretical anal

Weaknesses

My main concerns is regarding the novelty and practical applicability of the paper. The core problem — that spurious correlations can lead models to make incorrect predictions — is well established in prior robustness and fairness literature. While this paper contextualizes the issue within LLM hallucination detection and LLM confidence, the theoretical analysis is based on a kernel regression formulation that is too simplified to capture the dynamics of modern LLMs. Moreover, the theoretical bo

Reviewer 03Rating 4Confidence 4

Strengths

1. The problem is important: The paper focuses on spurious correlations, an overlooked source of hallucinations, and exposes the fundamental limitations of existing detection methods, demonstrating strong theoretical and practical significance. 2. Support for reproducibility: Detailed experimental settings, data construction, model configurations, and prompt templates are provided, facilitating reproduction and validation in future research.

Weaknesses

1. Limited experimental scale and model coverage: Although multiple models are included, the experiments mainly focus on small- to medium-scale models (e.g., 1.7B parameters), with insufficient systematic evaluation on larger models (e.g., tens to hundreds of billions of parameters). It is recommended to include results for larger-scale models. 2. Narrow definition of spurious correlations: Currently, spurious correlations are introduced only via surname-attribute mappings, whereas in reality,

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health via Writing · Adversarial Robustness in Machine Learning · Misinformation and Its Impacts