Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs
Md Farhamdur Reza, Richeng Jin, Tianfu Wu, and Huaiyu Dai

TL;DR
This paper analyzes the tradeoff in multimodal large language models between concealing harmful intent and enabling reconstruction, proposing strategies to exploit this for jailbreak attacks.
Contribution
It introduces a novel concealment-aware construction method and distractor images to better balance concealment and reconstruction, exposing vulnerabilities in MLLMs.
Findings
Existing transformations struggle to balance concealment and reconstruction.
Character-removed variants improve the concealment-reconstruction tradeoff.
Proposed strategies outperform baselines in revealing harmful intent.
Abstract
Intent-obfuscation-based jailbreak attacks on multimodal large language models (MLLMs) transform a harmful query into a concealed multimodal input to bypass safety mechanisms. We show that such attacks are governed by a \emph{reconstruction--concealment tradeoff}: the transformed input must hide harmful intent from safety filters while remaining recoverable enough for the victim model to reconstruct the original request. Through a reconstruction analysis of three representative black-box methods, we find that existing transformations struggle to balance this tradeoff, limiting their effectiveness. In contrast, we show that character-removed variants achieve a better balance. Building on this, we propose \emph{concealment-aware variant construction}, which greedily selects character-removed variants that are low in harmful-keyword alignment and mutually diverse, and instantiates them…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
