Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition
Inyong Koo, yeeun Seong, Minseok Son, Jaehyuk Jang, Changick Kim

TL;DR
This paper introduces a Transformer-based multimodal self-attention framework with temporal alignment techniques, including TaRoPE and CTM loss, to improve audio-visual emotion recognition by effectively handling frame-rate mismatches.
Contribution
It proposes novel temporal alignment methods, TaRoPE and CTM loss, within a Transformer framework for better multimodal feature synchronization in emotion recognition.
Findings
Improved accuracy on CREMA-D and RAVDESS datasets.
Effective handling of frame-rate mismatch enhances temporal cue preservation.
Outperforms recent baseline methods in multimodal emotion recognition.
Abstract
Audio-visual emotion recognition (AVER) methods typically fuse utterance-level features, and even frame-level attention models seldom address the frame-rate mismatch across modalities. In this paper, we propose a Transformer-based framework focusing on the temporal alignment of multimodal features. Our design employs a multimodal self-attention encoder that simultaneously captures intra- and inter-modal dependencies within a shared feature space. To address heterogeneous sampling rates, we incorporate Temporally-aligned Rotary Position Embeddings (TaRoPE), which implicitly synchronize audio and video tokens. Furthermore, we introduce a Cross-Temporal Matching (CTM) loss that enforces consistency among temporally proximate pairs, guiding the encoder toward better alignment. Experiments on CREMA-D and RAVDESS datasets demonstrate consistent improvements over recent baselines, suggesting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech and Audio Processing
