OFFSIDE: Benchmarking Unlearning Misinformation in Multimodal Large Language Models

Hao Zheng; Zirui Pang; Ling li; Zhijie Deng; Yuhan Pu; Zhaowei Zhu; Xiaobo Xia; Jiaheng Wei

arXiv:2510.22535·cs.AI·January 6, 2026

OFFSIDE: Benchmarking Unlearning Misinformation in Multimodal Large Language Models

Hao Zheng, Zirui Pang, Ling li, Zhijie Deng, Yuhan Pu, Zhaowei Zhu, Xiaobo Xia, Jiaheng Wei

PDF

4 Reviews

TL;DR

This paper introduces OFFSIDE, a comprehensive benchmark dataset for evaluating misinformation unlearning in multimodal large language models, addressing current limitations and revealing vulnerabilities in existing methods.

Contribution

It provides the first detailed benchmark for misinformation unlearning in MLLMs, including diverse test scenarios and analysis of current unlearning challenges.

Findings

01

Unimodal unlearning methods fail on multimodal rumors.

02

Unlearning efficacy is mainly affected by catastrophic forgetting.

03

Current methods are vulnerable to prompt attacks.

Abstract

Advances in Multimodal Large Language Models (MLLMs) intensify concerns about data privacy, making Machine Unlearning (MU), the selective removal of learned information, a critical necessity. However, existing MU benchmarks for MLLMs are limited by a lack of image diversity, potential inaccuracies, and insufficient evaluation scenarios, which fail to capture the complexity of real-world applications. To facilitate the development of MLLMs unlearning and alleviate the aforementioned limitations, we introduce OFFSIDE, a novel benchmark for evaluating misinformation unlearning in MLLMs based on football transfer rumors. This manually curated dataset contains 15.68K records for 80 players, providing a comprehensive framework with four test sets to assess forgetting efficacy, generalization, utility, and robustness. OFFSIDE supports advanced settings like selective unlearning and corrective…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

- OFFSIDE introduces a real-world, benchmark based on football transfer rumors, representing a realistic domain where misinformation and visual evidence interact. - The inclusion of four complementary unlearning scenarios (complete, selective, corrective relearning, and unimodal) adds diverse evaluation sets, extending prior unlearning paradigms to more complex multimodal and continual learning contexts.

Weaknesses

- Certain sections (e.g., Table 1 and parts of the Introduction) are somewhat dense, making it harder for readers to immediately grasp how OFFSIDE concretely differs from prior datasets. - While the paper introduces the “Corrective Relearning” scenario and lists four datasets (Forget Set, Retain Set, Test Set, Relearn Set), it only briefly mentions how these relate to continual learning (“simulates a continual learning framework to examine whether previously unlearned rumors can be successfully

Reviewer 02Rating 4Confidence 4

Strengths

- S1. The paper systematically divides the unlearning problem into Complete Unlearning, Selective Unlearning, Corrective Relearning, and Unimodal Unlearning, which facilitates a more fine-grained analysis. - S2. One of the main challenges in this field is the lack of datasets with a sufficient number of samples and the fact that most existing datasets are limited to facial images. In this regard, creating a dataset with a sufficient number of samples is highly commendable.

Weaknesses

- W1. The interpretation of the results is somewhat unclear. For example, in L382-383, the paper states, “We observe that all the baselines exhibit a performance drop (compared to the vanilla model) in both private information and shared information.”. However, looking at Table 3, the Shared Information scores do not appear to decrease significantly for GD and KL. The Retain Set in Complete Unlearning also shows a similar drop. Therefore, the difference between the findings of Selective and Comp

Reviewer 03Rating 4Confidence 3

Strengths

**S1. Systematic Formulation of Evaluation Settings** This paper organizes four unlearning scenarios, i.e., Complete, Selective, Corrective, and Unimodal, within a single benchmark. Although several of these aspects have been individually explored before, presenting them together in a unified benchmark offers practical value for researchers studying multimodal unlearning. **S2. Comprehensive Experimental Comparison** This paper provides a broad empirical comparison of representative unlearni

Weaknesses

**W1. Scope of Benchmark** The OFFSIDE benchmark focuses solely on football transfer rumors, which constitutes a narrow and specialized scenario. In contrast, other multimodal unlearning benchmarks such as MLLMU-Bench, CLEAR, and PEBench are designed around more general and socially relevant domains (e.g., person profiles, privacy, or harmful content) that better reflect practical application contexts for unlearning. The paper does not adequately explain how misinformation about football trans

Reviewer 04Rating 4Confidence 5

Strengths

1. This paper summarizes four settings of unlearning in MLLMs, which are significant contributions in the research field. 2. The rumors of images and texts in the football fiedl are well motivated to construct this benchmark, 3. The paper is well-written and easy to understand. 4.The unimodel unlearning methods are evaluated comprehensively across different tasks.

Weaknesses

Despite a novel benchmark on MU in MLLMs, there are several drawbacks in this paper 1. As a benchmark paper, several multimodal methods are not evaluated such as [1] [2], which did not provide a comprehensive evaluation of unlearning methods. 2. Table 1 lists MMUBench, however, in the introduction section this benchmark is not metioned, especially in line second paragraph, where related works are introduces. 3. The football field is rather limited in the real-world setting. Some other fields

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.