Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

Joe Dhanith P R; Shravan Venkatraman; Vigya Sharma; Santhosh Malarvannan

arXiv:2407.18552·cs.MM·January 21, 2026

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

Joe Dhanith P R, Shravan Venkatraman, Vigya Sharma, Santhosh Malarvannan

PDF

Open Access 1 Repo

TL;DR

This paper introduces AVT-CA, a novel transformer-based multimodal emotion recognition model that effectively fuses audio and visual cues using cross attention and hierarchical feature refinement, leading to improved accuracy.

Contribution

The paper presents a hierarchical video feature extraction method combined with cross-attention fusion in a transformer architecture for robust multimodal emotion recognition.

Findings

01

AVT-CA outperforms state-of-the-art methods on benchmark datasets.

02

Significant improvements in accuracy and F1-score achieved.

03

Effective suppression of irrelevant information through hierarchical attention.

Abstract

Multimodal emotion recognition (MER) aims to infer human affect by jointly modeling audio and visual cues; however, existing approaches often struggle with temporal misalignment, weakly discriminative feature representations, and suboptimal fusion of heterogeneous modalities. To address these challenges, we propose AVT-CA, an Audio-Video Transformer architecture with cross attention for robust emotion recognition. The proposed model introduces a hierarchical video feature representation that combines channel attention, spatial attention, and local feature extraction to emphasize emotionally salient regions while suppressing irrelevant information. These refined visual features are integrated with audio representations through an intermediate transformer-based fusion mechanism that captures interlinked temporal dependencies across modalities. Furthermore, a cross-attention module…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shravan-18/AVTCA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWireless Sensor Networks and IoT · Speech and Audio Processing · Internet of Things and Social Network Interactions

MethodsAttention Is All You Need · Adam · Label Smoothing · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections