Time-Frequency Transformer: A Novel Time Frequency Joint Learning Method   for Speech Emotion Recognition

Yong Wang; Cheng Lu; Yuan Zong; Hailun Lian; Yan Zhao; Sunan Li

arXiv:2308.14568·cs.SD·August 29, 2023

Time-Frequency Transformer: A Novel Time Frequency Joint Learning Method for Speech Emotion Recognition

Yong Wang, Cheng Lu, Yuan Zong, Hailun Lian, Yan Zhao, Sunan Li

PDF

Open Access

TL;DR

This paper introduces a novel Time-Frequency Transformer model that jointly learns time and frequency domain features for speech emotion recognition, capturing both local and global emotional patterns.

Contribution

It proposes a new joint learning framework using Transformer models to effectively model local and global emotional features in speech signals.

Findings

01

Outperforms state-of-the-art methods on IEMOCAP and CASIA datasets.

02

Effectively captures global emotion patterns in time-frequency domain.

03

Models local emotional correlations in time and frequency domains.

Abstract

In this paper, we propose a novel time-frequency joint learning method for speech emotion recognition, called Time-Frequency Transformer. Its advantage is that the Time-Frequency Transformer can excavate global emotion patterns in the time-frequency domain of speech signal while modeling the local emotional correlations in the time domain and frequency domain respectively. For the purpose, we first design a Time Transformer and Frequency Transformer to capture the local emotion patterns between frames and inside frequency bands respectively, so as to ensure the integrity of the emotion information modeling in both time and frequency domains. Then, a Time-Frequency Transformer is proposed to mine the time-frequency emotional correlations through the local time-domain and frequency-domain emotion features for learning more discriminative global speech emotion representation. The whole…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Emotion and Mood Recognition · Speech Recognition and Synthesis