TL;DR
This paper reveals a new vulnerability in large language models called Reasoning Hijacking, where models are manipulated to make decisions based on spurious criteria without changing their high-level goals.
Contribution
It introduces the Criteria Attack, demonstrating how current alignment techniques are fragile against adversarial prompts that manipulate reasoning logic.
Findings
Models are highly susceptible to spurious reasoning shortcuts.
Current defenses fail to detect attacks that preserve the high-level goal.
State-of-the-art models can be manipulated without goal deviation.
Abstract
Current LLM safety research predominantly focuses on mitigating Goal Hijacking, preventing attackers from redirecting a model's high-level objective (e.g., from "summarizing emails" to "phishing users"). In this paper, we argue that this perspective is incomplete and highlight a critical vulnerability in Reasoning Alignment. We expose the inherent fragility of current alignment techniques by proposing a new adversarial prompt attack paradigm: Reasoning Hijacking. To demonstrate this vulnerability, we instantiate it via the Criteria Attack, which subverts model judgments by injecting spurious decision criteria without altering the high-level task goal. Unlike Goal Hijacking, which attempts to override the system prompt, Reasoning Hijacking keeps the task goal intact but manipulates the model's decision-making logic by injecting spurious reasoning shortcuts. Through extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
