HASRD: Hierarchical Acoustic and Semantic Representation Disentanglement

Amir Hussein; Sameer Khurana; Gordon Wichern; Francois G. Germain; Jonathan Le Roux

arXiv:2506.00843·eess.AS·June 3, 2025

HASRD: Hierarchical Acoustic and Semantic Representation Disentanglement

Amir Hussein, Sameer Khurana, Gordon Wichern, Francois G. Germain, Jonathan Le Roux

PDF

Open Access

TL;DR

HASRD introduces a hierarchical framework that disentangles semantic and acoustic representations in speech, improving ASR accuracy and reconstruction quality while reducing bitrate.

Contribution

It presents a novel hierarchical disentanglement method that enhances speech representation learning by separating semantic and acoustic tokens effectively.

Findings

01

44% relative WER improvement over SpeechTokenizer

02

Achieves high-quality reconstruction with lower bitrate

03

Enhances encoder efficiency without losing performance

Abstract

Effective speech representations for spoken language models must balance semantic relevance with acoustic fidelity for high-quality reconstruction. However, existing approaches struggle to achieve both simultaneously. To address this, we introduce Hierarchical Acoustic and Semantic Representation Disentanglement (HASRD, pronounced `hazard'), a framework that factorizes self-supervised learning representations into discrete semantic and acoustic tokens. HASRD assigns the semantic representation to the first codebook, while encoding acoustic residuals in subsequent codebooks. This preserves ASR performance while achieving high-quality reconstruction. Additionally, we enhance HASRD's encoder efficiency, improving ASR performance without compromising reconstruction quality. Compared to SpeechTokenizer, HASRD achieves a 44% relative WER improvement, superior reconstruction quality, and 2x…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis