TL;DR
TaDiCodec introduces a novel, end-to-end diffusion-based speech tokenizer that integrates text guidance, achieving high compression efficiency and superior speech reconstruction quality without auxiliary models.
Contribution
It presents the first single-stage, end-to-end diffusion autoencoder for speech tokenization with text guidance, eliminating the need for complex multi-stage training or pre-trained models.
Findings
Achieves 6.25 Hz frame rate and 0.0875 kbps bitrate for 24 kHz speech.
Maintains superior speech quality and accuracy metrics like WER, SIM, and UTMOS.
Demonstrates effectiveness in zero-shot text-to-speech with autoregressive and masked models.
Abstract
Speech tokenizers serve as foundational components for speech language models, yet current designs exhibit several limitations, including: 1) dependence on multi-layer residual vector quantization structures or high frame rates, 2) reliance on auxiliary pre-trained models for semantic distillation, and 3) requirements for complex two-stage training processes. In this work, we introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec), a novel approach designed to overcome these challenges. TaDiCodec employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of 6.25 Hz and a corresponding bitrate of 0.0875 kbps with a single-layer codebook for 24 kHz speech, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
