Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models

Yuansen Liu; Yixuan Tang; Anthony Kum Hoe Tun

arXiv:2601.10294·cs.CR·April 28, 2026

Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models

Yuansen Liu, Yixuan Tang, Anthony Kum Hoe Tun

PDF

1 Repo

TL;DR

This paper reveals a new vulnerability in large language models called Reasoning Hijacking, where models are manipulated to make decisions based on spurious criteria without changing their high-level goals.

Contribution

It introduces the Criteria Attack, demonstrating how current alignment techniques are fragile against adversarial prompts that manipulate reasoning logic.

Findings

01

Models are highly susceptible to spurious reasoning shortcuts.

02

Current defenses fail to detect attacks that preserve the high-level goal.

03

State-of-the-art models can be manipulated without goal deviation.

Abstract

Current LLM safety research predominantly focuses on mitigating Goal Hijacking, preventing attackers from redirecting a model's high-level objective (e.g., from "summarizing emails" to "phishing users"). In this paper, we argue that this perspective is incomplete and highlight a critical vulnerability in Reasoning Alignment. We expose the inherent fragility of current alignment techniques by proposing a new adversarial prompt attack paradigm: Reasoning Hijacking. To demonstrate this vulnerability, we instantiate it via the Criteria Attack, which subverts model judgments by injecting spurious decision criteria without altering the high-level task goal. Unlike Goal Hijacking, which attempts to override the system prompt, Reasoning Hijacking keeps the task goal intact but manipulates the model's decision-making logic by injecting spurious reasoning shortcuts. Through extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Yuan-Hou/criteria_attack
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.