Reasoning Multimodal Large Language Model: Data Contamination and Dynamic Evaluation
Ming Liu, Wensheng Zhang

TL;DR
This paper introduces a dynamic evaluation framework for multimodal large language models that assesses their true generalization ability by perturbing tasks rather than inputs, revealing overfitting and data contamination issues.
Contribution
We propose a novel task perturbation method for evaluating MLLMs, providing deeper insights into their generalization beyond static benchmarks.
Findings
Fine-tuning on contaminated data improves task-specific performance
Models overfit to single tasks and falter under task shifts
Dynamic evaluation reveals overfitting and data leakage issues
Abstract
Multimodal Large Language Models (MLLMs) show impressive vision-language benchmark performance, yet growing concerns about data contamination (test set exposure during training) risk masking true generalization. This concern extends to reasoning MLLMs, often fine-tuned via reinforcement learning from potentially contaminated base models. We propose a novel dynamic evaluation framework to rigorously assess MLLM generalization, moving beyond static benchmarks. Instead of perturbing inputs, we perturb the task itself. Using the same visual input, models are evaluated across a family of tasks (e.g., QA, captioning, question posing, verification) to probe diverse capabilities. This task perturbation reveals whether model performance is robust or reliant on superficial task-specific cues. Our approach is analogous to loss landscape sharpness: models overfit or contaminated for a single task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Artificial Intelligence in Healthcare and Education
