TL;DR
This paper investigates how multimodal large reasoning models (MLRMs) can be manipulated through emotional cues to produce unsafe outputs, revealing significant safety vulnerabilities and proposing metrics to quantify these risks.
Contribution
It introduces EmoAgent, an adversarial framework that exploits emotional susceptibility in MLRMs, and develops three metrics to measure emotional and reasoning safety risks.
Findings
EmoAgent effectively hijacks reasoning pathways in MLRMs.
Models can produce harmful outputs despite visual risk detection.
Identifies persistent failure modes in transparent reasoning scenarios.
Abstract
We observe that MLRMs oriented toward human-centric service are highly susceptible to user emotional cues during the deep-thinking stage, often overriding safety protocols or built-in safety checks under high emotional intensity. Inspired by this key insight, we propose EmoAgent, an autonomous adversarial emotion-agent framework that orchestrates exaggerated affective prompts to hijack reasoning pathways. Even when visual risks are correctly identified, models can still produce harmful completions through emotional misalignment. We further identify persistent high-risk failure modes in transparent deep-thinking scenarios, such as MLRMs generating harmful reasoning masked behind seemingly safe responses. These failures expose misalignments between internal inference and surface-level behavior, eluding existing content-based safeguards. To quantify these risks, we introduce three metrics:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
