Consistency of Large Reasoning Models Under Multi-Turn Attacks
Yubo Li, Ramayya Krishnan, Rema Padman

TL;DR
This paper evaluates the robustness of large reasoning models against multi-turn adversarial attacks, revealing that reasoning improves but does not guarantee resilience, and identifying specific failure modes and challenges for confidence-based defenses.
Contribution
It systematically assesses reasoning models under adversarial pressure, uncovers distinct vulnerability profiles, and demonstrates the limitations of existing confidence-based defense methods.
Findings
Reasoning models outperform baselines but remain vulnerable to specific attacks.
Five failure modes identified, with Self-Doubt and Social Conformity accounting for half of failures.
Confidence-Aware Response Generation fails for reasoning models due to overconfidence issues.
Abstract
Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
