Multi-Microphone Speech Emotion Recognition using the Hierarchical   Token-semantic Audio Transformer Architecture

Ohad Cohen; Gershon Hazan; Sharon Gannot

arXiv:2406.03272·eess.AS·September 17, 2024

Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture

Ohad Cohen, Gershon Hazan, Sharon Gannot

PDF

Open Access

TL;DR

This paper introduces a multi-microphone speech emotion recognition system using a hierarchical token-semantic audio transformer, significantly improving robustness and accuracy in reverberant, real-world environments.

Contribution

It presents a novel multi-microphone processing approach combined with a transformer architecture to enhance emotion recognition in challenging acoustic conditions.

Findings

01

Multi-microphone input improves emotion classification accuracy.

02

Transformer-based model outperforms single-channel baselines.

03

Averaging and summing strategies effectively utilize multi-channel data.

Abstract

The performance of most emotion recognition systems degrades in real-life situations ('in the wild' scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of SER algorithms and develop a more robust system for adverse conditions. We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy. We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs. We evaluate two strategies: averaging mel-spectrograms across channels and summing patch-embedded representations. Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention