Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding
Xin Zhang, Lin Li, Xiangni Lu, Jianquan Liu, Kong Aik Lee

TL;DR
This paper introduces SimWhisper-Codec, a low-bitrate speech codec that uses a simplified, semantic-first approach based on Whisper, achieving better semantic and acoustic quality without external supervision.
Contribution
It presents a novel semantic-first approach that simplifies Whisper for high-fidelity acoustic reconstruction, outperforming semantically-supervised codecs.
Findings
SimWhisper-Codec outperforms Mimi Codec and SpeechTokenizer in both semantic and acoustic metrics.
Targeted architectural simplification enhances Whisper's acoustic modeling capabilities.
The approach does not require external supervision, simplifying the training process.
Abstract
Speech codecs serve as bridges between continuous speech signals and large language models, yet face an inherent conflict between acoustic fidelity and semantic preservation. To mitigate this conflict, prevailing methods augment acoustic codecs with complex semantic supervision. We explore the opposite direction: a semantic-first approach that starts from a semantically-capable model and adapts it for high-fidelity acoustic reconstruction. Through empirical analysis, we discover that targeted architectural simplification can unlock the acoustic modeling potential of Whisper, a text-aligned Automatic Speech Recognition (ASR) model. Based on this finding, we propose SimWhisper-Codec, a novel codec that balances the semantic and acoustic preservation by leveraging a frozen, simplified Whisper encoder without requiring external supervision. Experimental results demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques
