Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion   Recognition?

Vandana Rajan; Alessio Brutti; Andrea Cavallaro

arXiv:2202.09263·cs.LG·February 21, 2022

Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion Recognition?

Vandana Rajan, Alessio Brutti, Andrea Cavallaro

PDF

Open Access 1 Repo

TL;DR

This study compares cross-attention and self-attention mechanisms in multi-modal emotion recognition models, finding that both improve performance over state-of-the-art methods but are generally statistically comparable.

Contribution

The paper provides a systematic comparison of cross-attention versus self-attention mechanisms in multi-modal emotion recognition models.

Findings

01

Both models outperform state-of-the-art in accuracy.

02

Performance of cross-attention and self-attention models is statistically comparable.

03

Models using multiple modalities improve emotion classification accuracy.

Abstract

Humans express their emotions via facial expressions, voice intonation and word choices. To infer the nature of the underlying emotion, recognition models may use a single modality, such as vision, audio, and text, or a combination of modalities. Generally, models that fuse complementary information from multiple modalities outperform their uni-modal counterparts. However, a successful model that fuses modalities requires components that can effectively aggregate task-relevant information from each modality. As cross-modal attention is seen as an effective mechanism for multi-modal fusion, in this paper we quantify the gain that such a mechanism brings compared to the corresponding self-attention mechanism. To this end, we implement and compare a cross-attention and a self-attention model. In addition to attention, each model uses convolutional layers for local feature extraction and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

smartcameras/selfcrossattn
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition