Speech Swin-Transformer: Exploring a Hierarchical Transformer with   Shifted Windows for Speech Emotion Recognition

Yong Wang; Cheng Lu; Hailun Lian; Yan Zhao; Bj\"orn Schuller; Yuan; Zong; Wenming Zheng

arXiv:2401.10536·cs.CL·January 22, 2024·1 cites

Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition

Yong Wang, Cheng Lu, Hailun Lian, Yan Zhao, Bj\"orn Schuller, Yuan, Zong, Wenming Zheng

PDF

Open Access

TL;DR

This paper introduces Speech Swin-Transformer, a hierarchical Transformer model with shifted windows designed to capture multi-scale emotional features in speech signals, significantly improving speech emotion recognition performance.

Contribution

The paper proposes a novel hierarchical speech Transformer with shifted windows that effectively aggregates multi-scale emotional features for speech emotion recognition.

Findings

01

Outperforms state-of-the-art SER methods

02

Effective multi-scale emotion feature aggregation

03

Demonstrates superior hierarchical speech representation

Abstract

Swin-Transformer has demonstrated remarkable success in computer vision by leveraging its hierarchical feature representation based on Transformer. In speech signals, emotional information is distributed across different scales of speech features, e.\,g., word, phrase, and utterance. Drawing above inspiration, this paper presents a hierarchical speech Transformer with shifted windows to aggregate multi-scale emotion features for speech emotion recognition (SER), called Speech Swin-Transformer. Specifically, we first divide the speech spectrogram into segment-level patches in the time domain, composed of multiple frame patches. These segment-level patches are then encoded using a stack of Swin blocks, in which a local window Transformer is utilized to explore local inter-frame emotional information across frame patches of each segment patch. After that, we also design a shifted window…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Emotion and Mood Recognition · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Absolute Position Encodings · Label Smoothing · Layer Normalization · Adam · Residual Connection · Dropout · Linear Layer · Multi-Head Attention · Byte Pair Encoding