Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs
Yuefei Chen, Yihao Quan, Xiaodong Lin, Ruixiang Tang

TL;DR
This paper investigates the neuron-level basis of citation hallucinations in large language models, identifying specific neurons responsible for generating fictitious references and demonstrating methods to mitigate this issue.
Contribution
The study reveals field-specific hallucination neurons in LLMs and proposes a neuron-level intervention approach to reduce citation hallucinations.
Findings
Author name hallucinations are more frequent than other fields.
Field-specific hallucination neurons can be identified and manipulated.
Amplifying these neurons increases hallucinations; suppressing them reduces hallucinations.
Abstract
LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108{,}000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
