Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

Inyong Koo; yeeun Seong; Minseok Son; Jaehyuk Jang; Changick Kim

arXiv:2603.11095·cs.MM·March 13, 2026

Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

Inyong Koo, yeeun Seong, Minseok Son, Jaehyuk Jang, Changick Kim

PDF

Open Access

TL;DR

This paper introduces a Transformer-based multimodal self-attention framework with temporal alignment techniques, including TaRoPE and CTM loss, to improve audio-visual emotion recognition by effectively handling frame-rate mismatches.

Contribution

It proposes novel temporal alignment methods, TaRoPE and CTM loss, within a Transformer framework for better multimodal feature synchronization in emotion recognition.

Findings

01

Improved accuracy on CREMA-D and RAVDESS datasets.

02

Effective handling of frame-rate mismatch enhances temporal cue preservation.

03

Outperforms recent baseline methods in multimodal emotion recognition.

Abstract

Audio-visual emotion recognition (AVER) methods typically fuse utterance-level features, and even frame-level attention models seldom address the frame-rate mismatch across modalities. In this paper, we propose a Transformer-based framework focusing on the temporal alignment of multimodal features. Our design employs a multimodal self-attention encoder that simultaneously captures intra- and inter-modal dependencies within a shared feature space. To address heterogeneous sampling rates, we incorporate Temporally-aligned Rotary Position Embeddings (TaRoPE), which implicitly synchronize audio and video tokens. Furthermore, we introduce a Cross-Temporal Matching (CTM) loss that enforces consistency among temporally proximate pairs, guiding the encoder toward better alignment. Experiments on CREMA-D and RAVDESS datasets demonstrate consistent improvements over recent baselines, suggesting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech and Audio Processing