Recovering Performance in Speech Emotion Recognition from Discrete Tokens via Multi-Layer Fusion and Paralinguistic Feature Integration
Esther Sun, Abinay Reddy Naini, Carlos Busso

TL;DR
This paper investigates how to improve speech emotion recognition using discrete speech tokens by employing multi-layer fusion and paralinguistic feature integration, effectively recovering lost information and closing performance gaps.
Contribution
It introduces a novel approach combining attention-based multi-layer fusion and paralinguistic feature integration to enhance discrete token-based SER performance.
Findings
Multi-layer fusion improves information recovery from discrete tokens.
Integrating openSMILE features reintroduces crucial paralinguistic cues.
The proposed methods close the performance gap with continuous representations.
Abstract
Discrete speech tokens offer significant advantages for storage and language model integration, but their application in speech emotion recognition (SER) is limited by paralinguistic information loss during quantization. This paper presents a comprehensive investigation of discrete tokens for SER. Using a fine-tuned WavLM-Large model, we systematically quantify performance degradation across different layer configurations and k-means quantization granularities. To recover the information loss, we propose two key strategies: (1) attention-based multi-layer fusion to recapture complementary information from different layers, and (2) integration of openSMILE features to explicitly reintroduce paralinguistic cues. We also compare mainstream neural codec tokenizers (SpeechTokenizer, DAC, EnCodec) and analyze their behaviors when fused with acoustic features. Our findings demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Phonetics and Phonology Research
