What Makes Training Multi-Modal Classification Networks Hard?
Weiyao Wang, Du Tran, Matt Feiszli

TL;DR
This paper investigates why multi-modal networks often underperform compared to single-modal ones, identifies causes like overfitting and differing generalization rates, and proposes Gradient Blending to improve training.
Contribution
The paper introduces Gradient Blending, a novel method to optimize multi-modal training by addressing overfitting and modality-specific learning rates.
Findings
Gradient Blending outperforms standard baselines.
Achieves state-of-the-art accuracy on multiple benchmarks.
Addresses overfitting in multi-modal networks effectively.
Abstract
Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart. In our experiments, however, we observe the opposite: the best single-modal network always outperforms the multi-modal network. This observation is consistent across different combinations of modalities and on different tasks and benchmarks. This paper identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to increased capacity. Second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal. We address these two problems with a technique we call Gradient Blending, which computes an optimal blend of modalities based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
What Makes Training Multi-Modal Classification Networks Hard?· youtube
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
