Modeling strategies for speech enhancement in the latent space of a neural audio codec
Sofiene Kammoun, Xavier Alameda-Pineda, Simon Leglaive

TL;DR
This paper compares different neural audio codec representations for speech enhancement, finding continuous representations and non-autoregressive models offer practical advantages, especially when combined with encoder fine-tuning.
Contribution
It provides a comprehensive comparison of continuous versus discrete speech representations and autoregressive versus non-autoregressive models for neural speech enhancement.
Findings
Continuous latent representations outperform discrete tokens.
Non-autoregressive models are more practical than autoregressive ones.
Encoder fine-tuning improves enhancement metrics but affects codec reconstruction.
Abstract
Neural audio codecs (NACs) provide compact latent speech representations in the form of sequences of continuous vectors or discrete tokens. In this work, we investigate how these two types of speech representations compare when used as training targets for supervised speech enhancement. We consider both autoregressive and non-autoregressive speech enhancement models based on the Conformer architecture, as well as a simple baseline where the NAC encoder is simply fine-tuned for speech enhancement. Our experiments reveal three key findings: predicting continuous latent representations consistently outperforms discrete token prediction; autoregressive models achieve higher quality but at the expense of intelligibility and efficiency, making non-autoregressive models more attractive in practice; and adding encoder fine-tuning yields the strongest enhancement metrics overall, though at the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis
