Recovering Performance in Speech Emotion Recognition from Discrete Tokens via Multi-Layer Fusion and Paralinguistic Feature Integration

Esther Sun; Abinay Reddy Naini; Carlos Busso

arXiv:2601.17085·eess.AS·January 27, 2026

Recovering Performance in Speech Emotion Recognition from Discrete Tokens via Multi-Layer Fusion and Paralinguistic Feature Integration

Esther Sun, Abinay Reddy Naini, Carlos Busso

PDF

Open Access

TL;DR

This paper investigates how to improve speech emotion recognition using discrete speech tokens by employing multi-layer fusion and paralinguistic feature integration, effectively recovering lost information and closing performance gaps.

Contribution

It introduces a novel approach combining attention-based multi-layer fusion and paralinguistic feature integration to enhance discrete token-based SER performance.

Findings

01

Multi-layer fusion improves information recovery from discrete tokens.

02

Integrating openSMILE features reintroduces crucial paralinguistic cues.

03

The proposed methods close the performance gap with continuous representations.

Abstract

Discrete speech tokens offer significant advantages for storage and language model integration, but their application in speech emotion recognition (SER) is limited by paralinguistic information loss during quantization. This paper presents a comprehensive investigation of discrete tokens for SER. Using a fine-tuned WavLM-Large model, we systematically quantify performance degradation across different layer configurations and k-means quantization granularities. To recover the information loss, we propose two key strategies: (1) attention-based multi-layer fusion to recapture complementary information from different layers, and (2) integration of openSMILE features to explicitly reintroduce paralinguistic cues. We also compare mainstream neural codec tokenizers (SpeechTokenizer, DAC, EnCodec) and analyze their behaviors when fused with acoustic features. Our findings demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Phonetics and Phonology Research