CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate
Hankun Wang, Yiwei Guo, Chongtian Shao, Bohan Li, Kai Yu

TL;DR
CodecSlime introduces a dynamic frame rate approach for neural speech codecs, significantly reducing redundancy and improving reconstruction quality by adapting to speech's non-uniform temporal information density.
Contribution
It is the first to support dynamic frame rate in neural speech codecs, offering an unsupervised, architecture-agnostic method that improves efficiency and quality over fixed-frame-rate codecs.
Findings
Up to 32% reduction in WER compared to FFR baselines.
Supports multiple frame rates with consistent quality improvements.
Enables flexible trade-offs between quality and bitrate.
Abstract
Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR ( 600 bps), the reconstruction WER of CodecSlime is reduced by up…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
