Memory based fusion for multi-modal deep learning
Darshana Priyasad, Tharindu Fernando, Simon Denman, Sridha Sridharan,, Clinton Fookes

TL;DR
This paper introduces a Memory based Attentive Fusion layer for multi-modal deep learning that captures long-term dependencies and improves fusion performance over naive methods.
Contribution
The paper proposes a novel fusion layer incorporating explicit memory and attention mechanisms to better model long-term dependencies in multi-modal data.
Findings
Enhanced performance on multiple datasets
Generalizes across different modalities and networks
Outperforms naive fusion methods
Abstract
The use of multi-modal data for deep machine learning has shown promise when compared to uni-modal approaches with fusion of multi-modal features resulting in improved performance in several applications. However, most state-of-the-art methods use naive fusion which processes feature streams independently, ignoring possible long-term dependencies within the data during fusion. In this paper, we present a novel Memory based Attentive Fusion layer, which fuses modes by incorporating both the current features and longterm dependencies in the data, thus allowing the model to understand the relative importance of modes over time. We introduce an explicit memory block within the fusion layer which stores features containing long-term dependencies of the fused data. The feature inputs from uni-modal encoders are fused through attentive composition and transformation followed by naive fusion of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Music and Audio Processing · Video Analysis and Summarization
