Conflicts Make Large Reasoning Models Vulnerable to Attacks

Honghao Liu; Chengjin Xu; Xuhui Jiang; Cehao Yang; Shengming Yin; Zhengwu Ma; Lionel Ni; Jian Guo

arXiv:2604.09750·cs.CR·April 14, 2026

Conflicts Make Large Reasoning Models Vulnerable to Attacks

Honghao Liu, Chengjin Xu, Xuhui Jiang, Cehao Yang, Shengming Yin, Zhengwu Ma, Lionel Ni, Jian Guo

PDF

1 Repo

TL;DR

This paper examines how large reasoning models become more vulnerable to attacks when faced with conflicting objectives, revealing the importance of improved alignment strategies.

Contribution

It provides a comprehensive analysis of conflict-induced vulnerabilities in LRMs and introduces a new dataset and evaluation framework for assessing their robustness.

Findings

01

Conflicts significantly increase attack success rates.

02

Layerwise analysis shows safety representations shift under conflict.

03

Even simple prompts can cause models to behave unsafely.

Abstract

Large Reasoning Models (LRMs) have achieved remarkable performance across diverse domains, yet their decision-making under conflicting objectives remains insufficiently understood. This work investigates how LRMs respond to harmful queries when confronted with two categories of conflicts: internal conflicts that pit alignment values against each other and dilemmas, which impose mutually contradictory choices, including sacrificial, duress, agent-centered, and social forms. Using over 1,300 prompts across five benchmarks, we evaluate three representative LRMs - Llama-3.1-Nemotron-8B, QwQ-32B, and DeepSeek R1 - and find that conflicts significantly increase attack success rates, even under single-round non-narrative queries without sophisticated auto-attack techniques. Our findings reveal through layerwise and neuron-level analyses that safety-related and functional representations shift…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DataArcTech/ConflictHarm
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.