Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization
Luca Della Libera, Cem Subakan, Mirco Ravanelli

TL;DR
DyCAST introduces a flexible, character-aligned speech tokenizer that adapts frame rates and models durations explicitly, improving speech synthesis efficiency and quality with fewer tokens and retrieval-augmented decoding.
Contribution
It proposes DyCAST, a novel dynamic, character-aligned speech tokenizer with variable frame rates and explicit duration modeling, enhancing efficiency and synthesis quality.
Findings
Achieves competitive speech resynthesis quality.
Uses significantly fewer tokens than fixed-frame codecs.
Improves reconstruction fidelity with retrieval-augmented decoding.
Abstract
Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
