Adaptive Fusion Techniques for Multimodal Data
Gaurav Sahu, Olga Vechtomova

TL;DR
This paper introduces adaptive fusion methods for multimodal data that allow neural networks to learn how to combine heterogeneous modalities more effectively, improving performance in tasks like translation and emotion recognition.
Contribution
It proposes two novel adaptive fusion networks, Auto-Fusion and GAN-Fusion, which dynamically learn to combine multimodal features without relying on fixed deterministic operations.
Findings
Outperforms existing fusion methods in multimodal translation and emotion recognition.
Uses lightweight networks to effectively model context from multiple modalities.
Demonstrates better performance than transformer-based approaches.
Abstract
Effective fusion of data from multiple modalities, such as video, speech, and text, is challenging due to the heterogeneous nature of multimodal data. In this paper, we propose adaptive fusion techniques that aim to model context from different modalities effectively. Instead of defining a deterministic fusion operation, such as concatenation, for the network, we let the network decide "how" to combine a given set of multimodal features more effectively. We propose two networks: 1) Auto-Fusion, which learns to compress information from different modalities while preserving the context, and 2) GAN-Fusion, which regularizes the learned latent space given context from complementing modalities. A quantitative evaluation on the tasks of multimodal machine translation and emotion recognition suggests that our lightweight, adaptive networks can better model context from other modalities than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
