Unlearning Sensitive Information in Multimodal LLMs: Benchmark and   Attack-Defense Evaluation

Vaidehi Patil; Yi-Lin Sung; Peter Hase; Jie Peng; Tianlong Chen; Mohit; Bansal

arXiv:2505.01456·cs.CL·May 6, 2025

Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation

Vaidehi Patil, Yi-Lin Sung, Peter Hase, Jie Peng, Tianlong Chen, Mohit, Bansal

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces a new benchmark and framework for evaluating how well multimodal large language models can forget sensitive multimodal information, addressing a critical safety concern.

Contribution

It presents the first comprehensive benchmark and attack-defense framework for multimodal unlearning, including a novel whitebox method and analysis of model scale effects.

Findings

01

Multimodal attacks are more effective than text- or image-only attacks.

02

The best defense removes answer info from internal states.

03

Larger models show greater robustness after unlearning.

Abstract

LLMs trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs as they integrate information from multiple modalities (image and text). Adversaries can exploit this knowledge through multimodal prompts to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While prior work on unlearning has focused on text, multimodal unlearning remains underexplored. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. We extend a visual question-answering dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vaidehi99/unlok-vqa
pytorchOfficial

Datasets

vaidehi99/UnLOK-VQA
dataset· 58 dl
58 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Security and Intrusion Detection