Multimodal Speech Emotion Recognition using Cross Attention with Aligned   Audio and Text

Yoonhyung Lee; Seunghyun Yoon; Kyomin Jung

arXiv:2207.12895·eess.AS·July 27, 2022

Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

Yoonhyung Lee, Seunghyun Yoon, Kyomin Jung

PDF

TL;DR

This paper introduces a Cross Attention Network that aligns and integrates audio and text signals for improved speech emotion recognition, outperforming existing methods on the IEMOCAP dataset.

Contribution

The novel model employs aligned segmentation and cross attention mechanisms to effectively fuse audio and text modalities for emotion recognition.

Findings

01

Outperforms state-of-the-art on IEMOCAP by 2.66% and 3.18% in accuracy.

02

Uses aligned segmentation of audio and text signals for better multimodal fusion.

03

Employs cross attention to independently aggregate and cross-inform modalities.

Abstract

In this paper, we propose a novel speech emotion recognition model called Cross Attention Network (CAN) that uses aligned audio and text signals as inputs. It is inspired by the fact that humans recognize speech as a combination of simultaneously produced acoustic and textual signals. First, our method segments the audio and the underlying text signals into equal number of steps in an aligned way so that the same time steps of the sequential signals cover the same time span in the signals. Together with this technique, we apply the cross attention to aggregate the sequential information from the aligned signals. In the cross attention, each modality is aggregated independently by applying the global attention mechanism onto each modality. Then, the attention weights of each modality are applied directly to the other modality in a crossed way, so that the CAN gathers the audio and text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.