Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding

Xin Zhang; Lin Li; Xiangni Lu; Jianquan Liu; Kong Aik Lee

arXiv:2510.20504·cs.SD·January 28, 2026

Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding

Xin Zhang, Lin Li, Xiangni Lu, Jianquan Liu, Kong Aik Lee

PDF

Open Access 1 Models

TL;DR

This paper introduces SimWhisper-Codec, a low-bitrate speech codec that uses a simplified, semantic-first approach based on Whisper, achieving better semantic and acoustic quality without external supervision.

Contribution

It presents a novel semantic-first approach that simplifies Whisper for high-fidelity acoustic reconstruction, outperforming semantically-supervised codecs.

Findings

01

SimWhisper-Codec outperforms Mimi Codec and SpeechTokenizer in both semantic and acoustic metrics.

02

Targeted architectural simplification enhances Whisper's acoustic modeling capabilities.

03

The approach does not require external supervision, simplifying the training process.

Abstract

Speech codecs serve as bridges between continuous speech signals and large language models, yet face an inherent conflict between acoustic fidelity and semantic preservation. To mitigate this conflict, prevailing methods augment acoustic codecs with complex semantic supervision. We explore the opposite direction: a semantic-first approach that starts from a semantically-capable model and adapts it for high-fidelity acoustic reconstruction. Through empirical analysis, we discover that targeted architectural simplification can unlock the acoustic modeling potential of Whisper, a text-aligned Automatic Speech Recognition (ASR) model. Based on this finding, we propose SimWhisper-Codec, a novel codec that balances the semantic and acoustic preservation by leveraging a frozen, simplified Whisper encoder without requiring external supervision. Experimental results demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
xxx123456/SimWhisper_Codec
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques