Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion Recognition?
Vandana Rajan, Alessio Brutti, Andrea Cavallaro

TL;DR
This study compares cross-attention and self-attention mechanisms in multi-modal emotion recognition models, finding that both improve performance over state-of-the-art methods but are generally statistically comparable.
Contribution
The paper provides a systematic comparison of cross-attention versus self-attention mechanisms in multi-modal emotion recognition models.
Findings
Both models outperform state-of-the-art in accuracy.
Performance of cross-attention and self-attention models is statistically comparable.
Models using multiple modalities improve emotion classification accuracy.
Abstract
Humans express their emotions via facial expressions, voice intonation and word choices. To infer the nature of the underlying emotion, recognition models may use a single modality, such as vision, audio, and text, or a combination of modalities. Generally, models that fuse complementary information from multiple modalities outperform their uni-modal counterparts. However, a successful model that fuses modalities requires components that can effectively aggregate task-relevant information from each modality. As cross-modal attention is seen as an effective mechanism for multi-modal fusion, in this paper we quantify the gain that such a mechanism brings compared to the corresponding self-attention mechanism. To this end, we implement and compare a cross-attention and a self-attention model. In addition to attention, each model uses convolutional layers for local feature extraction and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition
