Multi-Scale Temporal Transformer For Speech Emotion Recognition

Zhipeng Li; Xiaofen Xing; Yuanbo Fang; Weibin Zhang; Hengsheng Fan,; Xiangmin Xu

arXiv:2410.00390·eess.AS·October 2, 2024·Interspeech

Multi-Scale Temporal Transformer For Speech Emotion Recognition

Zhipeng Li, Xiaofen Xing, Yuanbo Fang, Weibin Zhang, Hengsheng Fan,, Xiangmin Xu

PDF

Open Access

TL;DR

This paper introduces a Multi-Scale Transformer model that enhances speech emotion recognition by capturing multi-scale local features, outperforming existing methods while reducing computational costs.

Contribution

The paper proposes a novel Multi-Scale Transformer with three components to improve local emotion feature learning and efficiency in speech emotion recognition.

Findings

01

Significantly outperforms vanilla Transformer and state-of-the-art methods

02

Effective in capturing multi-scale local emotion representations

03

Reduces computational cost compared to existing models

Abstract

Speech emotion recognition plays a crucial role in human-machine interaction systems. Recently various optimized Transformers have been successfully applied to speech emotion recognition. However, the existing Transformer architectures focus more on global information and require large computation. On the other hand, abundant speech emotional representations exist locally on different parts of the input speech. To tackle these problems, we propose a Multi-Scale TRansfomer (MSTR) for speech emotion recognition. It comprises of three main components: (1) a multi-scale temporal feature operator, (2) a fractal self-attention module, and (3) a scale mixer module. These three components can effectively enhance the transformer's ability to learn multi-scale local emotion representations. Experimental results demonstrate that the proposed MSTR model significantly outperforms a vanilla…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Dense Connections · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings