TL;DR
This paper introduces L3-SE, a noise-invariant acoustic-semantic distillation framework that reduces linguistic hallucination in LM-based speech enhancement, especially under adverse noise conditions.
Contribution
It proposes a novel noise-invariant conditioning encoder learned via joint distillation of acoustic and semantic targets, improving linguistic consistency in speech enhancement.
Findings
Outperforms prior LM-based SE methods on linguistic metrics
Significant reduction in hallucination under low-SNR and reverberant conditions
Maintains competitive perceptual speech quality
Abstract
Language model (LM)-based speech enhancement (SE) can generate natural-sounding speech, but under severe noise it often suffers from unreliable conditioning, leading to perceptually plausible yet linguistically incorrect outputs. To address this issue, we propose L3-SE, a noise-invariant acoustic-semantic distillation framework for reducing linguistic hallucination in LM-based SE. The proposed method learns a noise-invariant conditioning encoder from noisy speech by jointly distilling two complementary clean-speech targets: an acoustic target for reconstruction fidelity and a semantic target for linguistic consistency. The resulting noise-invariant acoustic-semantic representations are used to condition a decoder-only autoregressive language model, which predicts clean acoustic tokens that are decoded into enhanced speech. To support high-quality generation, we further employ a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
