Boosting Multimodal Learning via Disentangled Gradient Learning

Shicai Wei; Chunbo Luo; Yang Luo

arXiv:2507.10213·cs.CV·July 15, 2025

Boosting Multimodal Learning via Disentangled Gradient Learning

Shicai Wei, Chunbo Luo, Yang Luo

PDF

Open Access

TL;DR

This paper identifies the optimization conflict in multimodal learning caused by cross-modal fusion and introduces a disentangled gradient learning framework to improve the training of multimodal models.

Contribution

It reveals the gradient interference issue in multimodal models and proposes DGL to decouple and optimize modality encoders and fusion modules effectively.

Findings

01

DGL improves performance across multiple modalities and tasks.

02

It effectively eliminates gradient interference in multimodal training.

03

Experimental results demonstrate DGL's versatility and effectiveness.

Abstract

Multimodal learning often encounters the under-optimized problem and may have worse performance than unimodal learning. Existing methods attribute this problem to the imbalanced learning between modalities and rebalance them through gradient modulation. However, they fail to explain why the dominant modality in multimodal models also underperforms that in unimodal learning. In this work, we reveal the optimization conflict between the modality encoder and modality fusion module in multimodal models. Specifically, we prove that the cross-modal fusion in multimodal models decreases the gradient passed back to each modality encoder compared with unimodal models. Consequently, the performance of each modality in the multimodal model is inferior to that in the unimodal model. To this end, we propose a disentangled gradient learning (DGL) framework to decouple the optimization of the modality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Face and Expression Recognition · Speech Recognition and Synthesis