Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition
Ying Liu, Yuntao Shou, Wei Ai, Tao Meng, and Keqin Li

TL;DR
This paper introduces a novel multimodal emotion recognition model that explicitly denoises audio and video signals, models intra- and inter-modal relations, and employs a diffusion mechanism guided by text for improved fusion.
Contribution
It proposes a relation-aware denoising and diffusion attention fusion framework that explicitly handles noisy modalities and models complex cross-modal dependencies.
Findings
Effective noise suppression in audio and video modalities.
Enhanced modeling of intra- and inter-modal emotional dependencies.
Improved multimodal emotion recognition accuracy.
Abstract
In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance. Most existing methods neglect the impact of noisy modalities and rely on implicit weighting to model modality importance, thereby failing to explicitly account for the predominant contribution of the textual modality in emotion understanding. To address these issues, we propose a relation-aware denoising and diffusion attention fusion model for MCER. Specifically, we first design a differential Transformer that explicitly computes the differences…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
