Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning
Junyi Wang, Chi Zhang, Jing Qian, Haifeng Luo, Hao Wang, Zengrui Jin, Chao Zhang

TL;DR
This paper introduces ClariCodec, a neural speech codec optimized for ultra-low bitrate communication at 300 bps, using reinforcement learning to improve intelligibility measured by word error rate.
Contribution
It presents a novel RL-based fine-tuning method for neural speech codecs that significantly enhances intelligibility at extremely low bitrates without sacrificing perceptual quality.
Findings
ClariCodec achieves 4.64% WER at 300 bps without RL.
RL fine-tuning reduces WER to 3.55% on test-clean.
The method improves intelligibility by 23% relative WER reduction.
Abstract
In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 300 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 4.64% WER on the LibriSpeech test-clean set at 300 bps, already competitive with codecs operating at higher bitrates. Further RL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
