Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization

Luca Della Libera; Cem Subakan; Mirco Ravanelli

arXiv:2601.23174·cs.LG·February 5, 2026

Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization

Luca Della Libera, Cem Subakan, Mirco Ravanelli

PDF

Open Access 1 Models

TL;DR

DyCAST introduces a flexible, character-aligned speech tokenizer that adapts frame rates and models durations explicitly, improving speech synthesis efficiency and quality with fewer tokens and retrieval-augmented decoding.

Contribution

It proposes DyCAST, a novel dynamic, character-aligned speech tokenizer with variable frame rates and explicit duration modeling, enhancing efficiency and synthesis quality.

Findings

01

Achieves competitive speech resynthesis quality.

02

Uses significantly fewer tokens than fixed-frame codecs.

03

Improves reconstruction fidelity with retrieval-augmented decoding.

Abstract

Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
lucadellalib/dycast
model· 86 dl· ♡ 3
86 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research