Reasoning Multimodal Large Language Model: Data Contamination and Dynamic Evaluation

Ming Liu; Wensheng Zhang

arXiv:2506.07202·cs.AI·June 10, 2025

Reasoning Multimodal Large Language Model: Data Contamination and Dynamic Evaluation

Ming Liu, Wensheng Zhang

PDF

Open Access

TL;DR

This paper introduces a dynamic evaluation framework for multimodal large language models that assesses their true generalization ability by perturbing tasks rather than inputs, revealing overfitting and data contamination issues.

Contribution

We propose a novel task perturbation method for evaluating MLLMs, providing deeper insights into their generalization beyond static benchmarks.

Findings

01

Fine-tuning on contaminated data improves task-specific performance

02

Models overfit to single tasks and falter under task shifts

03

Dynamic evaluation reveals overfitting and data leakage issues

Abstract

Multimodal Large Language Models (MLLMs) show impressive vision-language benchmark performance, yet growing concerns about data contamination (test set exposure during training) risk masking true generalization. This concern extends to reasoning MLLMs, often fine-tuned via reinforcement learning from potentially contaminated base models. We propose a novel dynamic evaluation framework to rigorously assess MLLM generalization, moving beyond static benchmarks. Instead of perturbing inputs, we perturb the task itself. Using the same visual input, models are evaluated across a family of tasks (e.g., QA, captioning, question posing, verification) to probe diverse capabilities. This task perturbation reveals whether model performance is robust or reliant on superficial task-specific cues. Our approach is analogous to loss landscape sharpness: models overfit or contaminated for a single task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Artificial Intelligence in Healthcare and Education