HYFuse: Aligning Heterogeneous Speech Pre-Trained Representations in Hyperbolic Space for Speech Emotion Recognition

Orchid Chetia Phukan; Girish; Mohd Mujtaba Akhtar; Swarup Ranjan Behera; Pailla Balakrishna Reddy; Arun Balaji Buduru; Rajesh Sharma

arXiv:2506.03403·eess.AS·June 5, 2025·Interspeech

HYFuse: Aligning Heterogeneous Speech Pre-Trained Representations in Hyperbolic Space for Speech Emotion Recognition

Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

PDF

Open Access

TL;DR

HYFuse introduces a novel hyperbolic space fusion method for combining neural audio codec features and pre-trained speech representations, significantly improving speech emotion recognition performance.

Contribution

This paper presents HYFuse, the first framework to effectively fuse heterogeneous speech representations in hyperbolic space for SER.

Findings

01

HYFuse achieves state-of-the-art results in SER.

02

Fusion of RLRs and CBRs outperforms individual representations.

03

Hyperbolic space transformation enhances representation complementarity.

Abstract

Compression-based representations (CBRs) from neural audio codecs such as EnCodec capture intricate acoustic features like pitch and timbre, while representation-learning-based representations (RLRs) from pre-trained models trained for speech representation learning such as WavLM encode high-level semantic and prosodic information. Previous research on Speech Emotion Recognition (SER) has explored both, however, fusion of CBRs and RLRs haven't been explored yet. In this study, we solve this gap and investigate the fusion of RLRs and CBRs and hypothesize they will be more effective by providing complementary information. To this end, we propose, HYFuse, a novel framework that fuses the representations by transforming them to hyperbolic space. With HYFuse, through fusion of x-vector (RLR) and Soundstream (CBR), we achieve the top performance in comparison to individual representations as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech Recognition and Synthesis