Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion

Zeyu Deng; Yanhui Lu; Jiashu Liao; Shuang Wu; and Chongfeng Wei

arXiv:2507.21395·cs.MM·July 30, 2025

Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion

Zeyu Deng, Yanhui Lu, Jiashu Liao, Shuang Wu, and Chongfeng Wei

PDF

TL;DR

Sync-TVA introduces a graph-attention framework that enhances cross-modal interaction and fusion for multimodal emotion recognition, leading to improved accuracy and robustness especially in imbalanced data scenarios.

Contribution

It proposes a novel end-to-end graph-attention model with dynamic modality enhancement and structured cross-modal fusion for better emotion recognition.

Findings

01

Outperforms state-of-the-art models on MELD and IEMOCAP datasets.

02

Achieves higher accuracy and weighted F1 scores, especially with imbalanced classes.

03

Demonstrates robustness across different multimodal emotion recognition scenarios.

Abstract

Multimodal emotion recognition (MER) is crucial for enabling emotionally intelligent systems that perceive and respond to human emotions. However, existing methods suffer from limited cross-modal interaction and imbalanced contributions across modalities. To address these issues, we propose Sync-TVA, an end-to-end graph-attention framework featuring modality-specific dynamic enhancement and structured cross-modal fusion. Our design incorporates a dynamic enhancement module for each modality and constructs heterogeneous cross-modal graphs to model semantic relations across text, audio, and visual features. A cross-attention fusion mechanism further aligns multimodal cues for robust emotion inference. Experiments on MELD and IEMOCAP demonstrate consistent improvements over state-of-the-art models in both accuracy and weighted F1 score, especially under class-imbalanced conditions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.