Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models

Siqi Lu; Wanying Xu; Yongbin Zheng; Wenting Luan; Peng Sun; Jianhang Yao

arXiv:2602.22644·cs.CV·February 27, 2026

Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models

Siqi Lu, Wanying Xu, Yongbin Zheng, Wenting Luan, Peng Sun, Jianhang Yao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a low-cost, plug-and-play module that dynamically balances modality contributions in multimodal models using frequency domain analysis, significantly improving robustness to missing modalities.

Contribution

The work presents a novel Frequency Ratio Metric and a Multimodal Weight Allocation Module that enhance multimodal learning robustness and can be integrated into various architectures.

Findings

01

MWAM improves performance across diverse tasks and modalities.

02

The method enhances robustness to missing modalities.

03

It boosts state-of-the-art models addressing modality absence.

Abstract

Missing modalities present a fundamental challenge in multimodal models, often causing catastrophic performance degradation. Our observations suggest that this fragility stems from an imbalanced learning process, where the model develops an implicit preference for certain modalities, leading to the under-optimization of others. We propose a simple yet efficient method to address this challenge. The central insight of our work is that the dominance relationship between modalities can be effectively discerned and quantified in the frequency domain. To leverage this principle, we first introduce a Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in the frequency domain. Guided by FRM, we then propose a Multimodal Weight Allocation Module, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- he idea of diagnosing and correcting multimodal imbalance in the frequency domain is both intuitive and underexplored. The FRM formulation provides a new angle that complements existing spatial-domain balancing techniques. - MWAM is architecture-agnostic, parameter-light, and easy to integrate into existing backbones. This makes it particularly attractive for practitioners working on robustness in multimodal models. - The authors validate their method on diverse datasets and tasks (segmentatio

Weaknesses

- While the frequency-domain motivation is intuitive, the paper lacks a rigorous theoretical connection between FRM and gradient dynamics. A more formal justification for why FRM effectively measures modality dominance would strengthen the contribution. - All experiments are on moderate-sized datasets. It remains unclear whether FRM and MWAM scale effectively to modern multimodal foundation models. - Some figures and equations are dense and could be better formatted for readability.

Reviewer 02Rating 6Confidence 3

Strengths

This paper analyzes the modality imbalance problem from the frequency domain and introduces an FRM to quantify how much the model prefers each modality. This seems to be promising. Based on FRM, this paper introduces MWAM to modulate the training process of multimodal models so that the performance can be balanced and enhanced. The writting is easy to understand.

Weaknesses

"The overall training process becomes more stable, as evidenced by the reduced variance in the total loss curve". But in Figure. 4 SF-MD (w/o Intervention)'s total loss is more stable than that of SF-MD (w / MD (w / Loss Intervention ntervention).

Reviewer 03Rating 6Confidence 3

Strengths

1) Using frequency domain analysis (FRM) to diagnose and mitigate modality bias sounds interesting, which goes beyond spatial domain balancing. 2) The proposed MWAM is a lightweight and plug-and-play module with negligible computational overhead and no additional parameters during inference. This makes it attractive for real-world deployment. 3) The method is validated on multiple tasks (classification, segmentation, detection) and datasets, showing improvements over baseline methods. Extensiv

Weaknesses

1. The approach is tailored to image-based modalities and frequency domain analysis. It is unclear how well MWAM would generalize to other multimodal settings (e.g., audio-text, video-language) where frequency analysis may not be directly applicable 2. MWAM introduces several hyperparameters that require careful tuning for different tasks and batch sizes. While sensitivity analysis is provided and the authors claim the performance is not that sensistive to such hyper-parameters, considering the

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis