# When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

**Authors:** Hanqi Yan, Hainiu Xu, Siya Qi, Shu Yang, Yulan He

arXiv: 2509.00544 · 2026-03-11

## TL;DR

This paper investigates Reasoning-Induced Misalignment (RIM) in large language models, revealing how certain reasoning patterns can cause safety misalignments through specific neural mechanisms and neuron entanglement.

## Contribution

It provides the first mechanistic explanation of RIM, linking attention head behavior and neuron activation entanglement to reasoning-induced safety issues.

## Key findings

- Attention heads reduce focus on Chain of Thought tokens during refusal.
- Higher activation entanglement between reasoning and safety neurons after fine-tuning.
- Entanglement correlates with catastrophic forgetting in safety-critical neurons.

## Abstract

With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00544/full.md

## Figures

40 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00544/full.md

## References

48 references — full list in the complete paper: https://tomesphere.com/paper/2509.00544/full.md

---
Source: https://tomesphere.com/paper/2509.00544