Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning
Zhaocheng Liu, Zhiwen Yu, Xiaoqing Liu

TL;DR
This paper introduces a GMM-guided adaptive loss framework for audio-visual learning that dynamically addresses modality imbalance at the sample level, improving performance and data quality.
Contribution
It proposes a novel GMM-based method to diagnose and mitigate modality imbalance dynamically, outperforming existing static approaches in multimodal learning.
Findings
Significantly outperforms state-of-the-art baselines on multiple datasets.
Effectively identifies and filters noisy samples to enhance learning.
Demonstrates improved fusion and alignment through adaptive loss.
Abstract
Multimodal learning integrates diverse modalities but suffers from modality imbalance, where dominant modalities suppress weaker ones due to inconsistent convergence rates. Existing methods predominantly rely on static modulation or heuristics, overlooking sample-level distributional variations in prediction bias. Specifically, they fail to distinguish outlier samples where the modality gap is exacerbated by low data quality. We propose a framework to quantitatively diagnose and dynamically mitigate this imbalance at the sample level. We introduce the Modality Gap metric to quantify prediction discrepancies. Analysis reveals that this gap follows a bimodal distribution, indicating the coexistence of balanced and imbalanced sample subgroups. We employ a Gaussian Mixture Model (GMM) to explicitly model this distribution, leveraging Bayesian posterior probabilities for soft subgroup…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
