CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

Hankun Wang; Yiwei Guo; Chongtian Shao; Bohan Li; Kai Yu

arXiv:2506.21074·eess.AS·February 4, 2026

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

Hankun Wang, Yiwei Guo, Chongtian Shao, Bohan Li, Kai Yu

PDF

Open Access

TL;DR

CodecSlime introduces a dynamic frame rate approach for neural speech codecs, significantly reducing redundancy and improving reconstruction quality by adapting to speech's non-uniform temporal information density.

Contribution

It is the first to support dynamic frame rate in neural speech codecs, offering an unsupervised, architecture-agnostic method that improves efficiency and quality over fixed-frame-rate codecs.

Findings

01

Up to 32% reduction in WER compared to FFR baselines.

02

Supports multiple frame rates with consistent quality improvements.

03

Enables flexible trade-offs between quality and bitrate.

Abstract

Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR ( $\approx$ 600 bps), the reconstruction WER of CodecSlime is reduced by up…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis