Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models
Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, Zhe Zhao

TL;DR
This paper identifies the issue of modality interference in multimodal large language models and proposes a unified finetuning approach with data augmentation and regularization to improve robustness and generalization across various benchmarks.
Contribution
It introduces a novel finetuning framework combining heuristic and adversarial perturbations with output consistency regularization to mitigate modality interference.
Findings
Enhanced robustness in unimodal tasks
Improved generalization across benchmarks
Maintained or improved multimodal performance
Abstract
Multimodal Large Language Models demonstrate strong performance on multimodal benchmarks, yet often exhibit poor robustness when exposed to spurious modality interference, such as irrelevant text in vision understanding, or irrelevant visual content in question answering. At its core, modality interference refers to cases where spurious signals from non-essential modalities distort model decisions, which we systematically analyze through causal, perturbation-based diagnostic experiments. To address this problem, we propose a unified finetuning framework that combines heuristic and adversarial perturbation-based data augmentation with output-level consistency regularization between original and perturbed inputs. Extensive experiments across image-heavy, text-heavy, and multimodal benchmarks, spanning multiple MLLM architectures and model scales, demonstrate consistent improvements in…
Peer Reviews
Decision·Submitted to ICLR 2026
Innovative Concept: The paper introduces the notion of the Cross-Modality Competency Problem, providing a fresh perspective on modality interference in multimodal large language models. This innovative approach contributes new insights to the field. Systematic Analysis: By designing a perturbation-based causal diagnostic experiment, the authors quantify the impact of modality interference, providing empirical evidence that enhances the scientific rigor and validity of the research. Effective S
Potential Overfitting Risks: The use of perturbation-based data augmentation may introduce noise into the training process. While it aims to enhance robustness, there is a risk that the model might overfit to these perturbed examples, resulting in poorer generalization on clean, real-world data. Lack of Comparative Baselines: The paper does not provide a comprehensive comparison against a wider variety of existing methods or models that address modality interference. Without robust baseline co
1. The definition of Modality Interference is well defined and the findings that model performance goes down due to sub-optimal integration information across modalities is interesting. 2. The motivation behind the proposed losses are well defined. 3. The paper is well written and easy to follow.
1. The proposed losses are not very effective : As shown in the ablations in Table 2, FFT with VQA/AUG performs better than proposed losses. Examples being : LLaVA-1.5-13B - FFT with $D^{AUG}$ - on ScienceQA-IMG | Qwen2.5-VL-3B - FFT with $D^{VQA}$ - on MM-Bench-EN. 2. Consistency of results : In Table 1, the drop in performance of models on Caltech 101 is quite high, for example LLaVA-1.5-7B, goes from (97.0 --> 57.4), but in Table 4 : the drop is much less on OCR images (97.0 --> 92.8) ; thi
- The paper demonstrates that MLLMs are not robust under modality interference where different modalities are not aligned and only one modality is relevant to the task. This highlights an important robustness issue in MLLMs. - The paper shows that using modality misaligned data for fine-tuning can mitigate modality interference and is effective in boosting both unimodal and multimodal reasoning abilities. - Experiments show that the proposed method is effective with different model families i
- The technical contributions of the paper are limited. The proposed perturbation-based data augmentations are not novel, and it is expected to see performance improvements when incorporating data with modality interference. - Choice of the datasets: Why do the authors choose Mini-ImageNet and Caltech-101 as Image-heavy datasets and Open-BookQA and MMLU as text-heavy datasets? Would different choices of these datasets affect the performance of fine-tuned models on VQA tasks? - Concern on the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
