CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation
Zhao Tong, Chunlin Gong, Yiping Zhang, Haichao Shi, Qiang Liu, Xingcheng Xu, Shu Wu, Xiao-Yu Zhang

TL;DR
This paper reveals that LLMs can internally generate unsafe narratives during fake news creation even when they refuse harmful requests, highlighting the need for deeper safety analysis beyond final outputs.
Contribution
It introduces a unified safety-analysis framework that examines internal reasoning layers and attention heads to identify unsafe patterns in LLMs during fake news generation.
Findings
Unsafe reasoning can persist internally despite refusal responses.
Critical attention heads responsible for unsafe divergence are concentrated in mid-depth layers.
Activation of reasoning mode increases generation risk significantly.
Abstract
From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Topic Modeling · AI in Service Interactions
