SUA: Stealthy Multimodal Large Language Model Unlearning Attack
Xianren Zhang, Hui Liu, Delvin Ce Zhang, Xianfeng Tang, Qi He, Dongwon Lee, Suhang Wang

TL;DR
This paper introduces SUA, a novel stealthy attack framework that can recover unlearned sensitive information from multimodal large language models by using a universal, semantically subtle noise pattern.
Contribution
The paper proposes a new unlearning attack method that effectively reveals forgotten knowledge in MLLMs using a universal, stealthy noise pattern with embedding alignment, highlighting vulnerabilities in current unlearning defenses.
Findings
SUA can successfully recover unlearned sensitive information.
A single perturbation generalizes to unseen images, revealing forgotten content.
The attack remains semantically unnoticeable due to embedding alignment.
Abstract
Multimodal Large Language Models (MLLMs) trained on massive data may memorize sensitive personal information and photos, posing serious privacy risks. To mitigate this, MLLM unlearning methods are proposed, which fine-tune MLLMs to reduce the ``forget'' sensitive information. However, it remains unclear whether the knowledge has been truly forgotten or just hidden in the model. Therefore, we propose to study a novel problem of LLM unlearning attack, which aims to recover the unlearned knowledge of an unlearned LLM. To achieve the goal, we propose a novel framework Stealthy Unlearning Attack (SUA) framework that learns a universal noise pattern. When applied to input images, this noise can trigger the model to reveal unlearned content. While pixel-level perturbations may be visually subtle, they can be detected in the semantic embedding space, making such attacks vulnerable to potential…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling
