TAGF: Time-aware Gated Fusion for Multimodal Valence-Arousal Estimation

Yubeen Lee; Sangeun Lee; Chaewon Park; Junyeop Cha; Eunil Park

arXiv:2507.02080·cs.MM·July 4, 2025

TAGF: Time-aware Gated Fusion for Multimodal Valence-Arousal Estimation

Yubeen Lee, Sangeun Lee, Chaewon Park, Junyeop Cha, Eunil Park

PDF

TL;DR

This paper introduces TAGF, a novel time-aware gated fusion framework that enhances multimodal valence-arousal estimation by adaptively integrating audio-visual features with temporal dynamics, improving robustness and accuracy.

Contribution

The paper proposes a BiLSTM-based temporal gating mechanism for recursive fusion, effectively capturing emotional evolution and modality interplay in multimodal emotion recognition.

Findings

01

Achieves competitive performance on Aff-Wild2 dataset

02

Demonstrates robustness to cross-modal misalignment

03

Models dynamic emotional transitions effectively

Abstract

Multimodal emotion recognition often suffers from performance degradation in valence-arousal estimation due to noise and misalignment between audio and visual modalities. To address this challenge, we introduce TAGF, a Time-aware Gated Fusion framework for multimodal emotion recognition. The TAGF adaptively modulates the contribution of recursive attention outputs based on temporal dynamics. Specifically, the TAGF incorporates a BiLSTM-based temporal gating mechanism to learn the relative importance of each recursive step and effectively integrates multistep cross-modal features. By embedding temporal awareness into the recursive fusion process, the TAGF effectively captures the sequential evolution of emotional expressions and the complex interplay between modalities. Experimental results on the Aff-Wild2 dataset demonstrate that TAGF achieves competitive performance compared with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.