TL;DR
This paper introduces MMTM, a simple yet effective neural module for multimodal feature fusion in CNNs, improving recognition accuracy across various multimodal tasks with minimal architectural changes.
Contribution
The paper proposes the Multimodal Transfer Module (MMTM), enabling slow, flexible fusion of multiple modalities within CNNs while allowing easy integration with pretrained models.
Findings
Improves recognition accuracy on multiple datasets.
Achieves state-of-the-art or competitive results.
Facilitates multimodal fusion with minimal architectural modifications.
Abstract
In late fusion, each modality is processed in a separate unimodal Convolutional Neural Network (CNN) stream and the scores of each modality are fused at the end. Due to its simplicity late fusion is still the predominant approach in many state-of-the-art multimodal applications. In this paper, we present a simple neural network module for leveraging the knowledge from multiple modalities in convolutional neural networks. The propose unit, named Multimodal Transfer Module (MMTM), can be added at different levels of the feature hierarchy, enabling slow modality fusion. Using squeeze and excitation operations, MMTM utilizes the knowledge of multiple modalities to recalibrate the channel-wise features in each CNN stream. Despite other intermediate fusion methods, the proposed module could be used for feature modality fusion in convolution layers with different spatial dimensions. Another…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
MMTM: Multimodal Transfer Module for CNN Fusion· youtube
MMTM: Multimodal Transfer Module for CNN Fusion· youtube
Taxonomy
MethodsConvolution
