From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding

Jayeon Yi; Minje Kim

arXiv:2602.06213·eess.AS·February 9, 2026

From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding

Jayeon Yi, Minje Kim

PDF

Open Access

TL;DR

This paper introduces language model-driven losses for neural speech coding at ultra-low bitrates, effectively reducing hallucinations and improving semantic fidelity by leveraging pretrained speech-text models.

Contribution

It proposes novel LM loss functions that outperform semantic distillation in low-bitrate speech codecs, utilizing modified ASR models and self-supervised speech representations.

Findings

01

LM losses better reduce phoneme hallucinations than SD objectives.

02

Enhanced semantic adherence in decoded speech with preserved quality.

03

Applicable in very-low-bitrate speech coding scenarios.

Abstract

``Phoneme Hallucinations (PH)'' commonly occur in low-bitrate DNN-based codecs. It is the generative decoder's attempt to synthesize plausible outputs from excessively compressed tokens missing some semantic information. In this work, we propose language model-driven losses (LM loss) and show they may alleviate PHs better than a semantic distillation (SD) objective in very-low-bitrate settings. The proposed LM losses build upon language models pretrained to associate speech with text. When ground-truth transcripts are unavailable, we propose to modify a popular automatic speech recognition (ASR) model, Whisper, to compare the decoded utterance against the ASR-inferred transcriptions of the input speech. Else, we propose to use the timed-text regularizer (TTR) to compare WavLM representations of the decoded utterance against BERT representations of the ground-truth transcriptions. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Speech Recognition and Synthesis · Speech and Audio Processing