Factorized RVQ-GAN For Disentangled Speech Tokenization

Sameer Khurana; Dominik Klement; Antoine Laurent; Dominik Bobos; Juraj Novosad; Peter Gazdik; Ellen Zhang; Zili Huang; Amir Hussein; Ricard Marxer; Yoshiki Masuyama; Ryo Aihara; Chiori Hori; Francois G. Germain; Gordon Wichern; Jonathan Le Roux

arXiv:2506.15456·eess.AS·June 19, 2025

Factorized RVQ-GAN For Disentangled Speech Tokenization

Sameer Khurana, Dominik Klement, Antoine Laurent, Dominik Bobos, Juraj Novosad, Peter Gazdik, Ellen Zhang, Zili Huang, Amir Hussein, Ricard Marxer, Yoshiki Masuyama, Ryo Aihara, Chiori Hori, Francois G. Germain, Gordon Wichern, Jonathan Le Roux

PDF

Open Access

TL;DR

This paper introduces HAC, a neural speech codec that factorizes its bottleneck into acoustic, phonetic, and lexical levels, enabling disentangled and interpretable speech representations for improved speech processing tasks.

Contribution

HAC is a novel hierarchical speech codec that combines knowledge distillation from pre-trained models to produce disentangled, multi-level speech tokens within a single unified model.

Findings

01

HAC tokens align with phonemes and words, capturing linguistic structure.

02

HAC outperforms baselines in disentanglement and reconstruction quality.

03

HAC preserves naturalness and interpretability of speech representations.

Abstract

We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders

MethodsKnowledge Distillation