Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks
Nan Wu, Stanis{\l}aw Jastrz\k{e}bski, Kyunghyun Cho, Krzysztof J., Geras

TL;DR
This paper identifies a greedy learning behavior in multi-modal neural networks where models overly depend on one modality, and proposes a training algorithm to balance learning across modalities, improving generalization.
Contribution
It introduces the concept of conditional utilization rate and a proxy called conditional learning speed to address greedy modality reliance, with an algorithm to balance learning speeds.
Findings
Balanced learning speeds improve model generalization.
Algorithm enhances performance across multiple datasets.
Conditional utilization rate correlates with model dependence on modalities.
Abstract
We hypothesize that due to the greedy nature of learning in multi-modal deep neural networks, these models tend to rely on just one modality while under-fitting the other modalities. Such behavior is counter-intuitive and hurts the models' generalization, as we observe empirically. To estimate the model's dependence on each modality, we compute the gain on the accuracy when the model has access to it in addition to another modality. We refer to this gain as the conditional utilization rate. In the experiments, we consistently observe an imbalance in conditional utilization rates between modalities, across multiple tasks and architectures. Since conditional utilization rate cannot be computed efficiently during training, we introduce a proxy for it based on the pace at which the model learns from each modality, which we refer to as the conditional learning speed. We propose an algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems
