Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations
Dingkang Yang, Mingcheng Li, Linhao Qu, Kun Yang, Peng Zhai, Song, Wang, Lihua Zhang

TL;DR
This paper introduces a novel multimodal fusion method called MEA that learns modality-exclusive and -agnostic representations to effectively handle asynchronous sequences and heterogeneity in video-based human intention understanding.
Contribution
The paper proposes a new fusion framework with self-attention, cross-modal attention, and adversarial strategies to improve multimodal sequence integration and robustness.
Findings
Outperforms existing methods on three multimodal datasets.
Effectively captures context dynamics within modalities.
Enhances knowledge exchange across heterogeneous modalities.
Abstract
Understanding human intentions (e.g., emotions) from videos has received considerable attention recently. Video streams generally constitute a blend of temporal data stemming from distinct modalities, including natural language, facial expressions, and auditory clues. Despite the impressive advancements of previous works via attention-based paradigms, the inherent temporal asynchrony and modality heterogeneity challenges remain in multimodal sequence fusion, causing adverse performance bottlenecks. To tackle these issues, we propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations (MEA) to refine multimodal features and leverage the complementarity across distinct modalities. On the one hand, MEA introduces a predictive self-attention module to capture reliable context dynamics within modalities and reinforce unique features over the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Face and Expression Recognition · Speech and Audio Processing
MethodsSoftmax · Attention Is All You Need
