Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation

Zheng Wang; Xiaobin Rong; Hang Su; Tianyi Tan; Junnan Wu; Lichun Fan; Zhenbo Luo; Jian Luan; Jing Lu

arXiv:2605.08608·eess.AS·May 12, 2026

Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation

Zheng Wang, Xiaobin Rong, Hang Su, Tianyi Tan, Junnan Wu, Lichun Fan, Zhenbo Luo, Jian Luan, Jing Lu

PDF

1 Repo

TL;DR

This paper introduces L3-SE, a noise-invariant acoustic-semantic distillation framework that reduces linguistic hallucination in LM-based speech enhancement, especially under adverse noise conditions.

Contribution

It proposes a novel noise-invariant conditioning encoder learned via joint distillation of acoustic and semantic targets, improving linguistic consistency in speech enhancement.

Findings

01

Outperforms prior LM-based SE methods on linguistic metrics

02

Significant reduction in hallucination under low-SNR and reverberant conditions

03

Maintains competitive perceptual speech quality

Abstract

Language model (LM)-based speech enhancement (SE) can generate natural-sounding speech, but under severe noise it often suffers from unreliable conditioning, leading to perceptually plausible yet linguistically incorrect outputs. To address this issue, we propose L3-SE, a noise-invariant acoustic-semantic distillation framework for reducing linguistic hallucination in LM-based SE. The proposed method learns a noise-invariant conditioning encoder from noisy speech by jointly distilling two complementary clean-speech targets: an acoustic target for reconstruction fidelity and a semantic target for linguistic consistency. The resulting noise-invariant acoustic-semantic representations are used to condition a decoder-only autoregressive language model, which predicts clean acoustic tokens that are decoded into enhanced speech. To support high-quality generation, we further employ a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.