Asynchronous Multimodal Video Sequence Fusion via Learning   Modality-Exclusive and -Agnostic Representations

Dingkang Yang; Mingcheng Li; Linhao Qu; Kun Yang; Peng Zhai; Song; Wang; Lihua Zhang

arXiv:2407.04955·cs.CV·October 1, 2024

Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

Dingkang Yang, Mingcheng Li, Linhao Qu, Kun Yang, Peng Zhai, Song, Wang, Lihua Zhang

PDF

Open Access

TL;DR

This paper introduces a novel multimodal fusion method called MEA that learns modality-exclusive and -agnostic representations to effectively handle asynchronous sequences and heterogeneity in video-based human intention understanding.

Contribution

The paper proposes a new fusion framework with self-attention, cross-modal attention, and adversarial strategies to improve multimodal sequence integration and robustness.

Findings

01

Outperforms existing methods on three multimodal datasets.

02

Effectively captures context dynamics within modalities.

03

Enhances knowledge exchange across heterogeneous modalities.

Abstract

Understanding human intentions (e.g., emotions) from videos has received considerable attention recently. Video streams generally constitute a blend of temporal data stemming from distinct modalities, including natural language, facial expressions, and auditory clues. Despite the impressive advancements of previous works via attention-based paradigms, the inherent temporal asynchrony and modality heterogeneity challenges remain in multimodal sequence fusion, causing adverse performance bottlenecks. To tackle these issues, we propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations (MEA) to refine multimodal features and leverage the complementarity across distinct modalities. On the one hand, MEA introduces a predictive self-attention module to capture reliable context dynamics within modalities and reinforce unique features over the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Face and Expression Recognition · Speech and Audio Processing

MethodsSoftmax · Attention Is All You Need