Efficient Speech Enhancement via Embeddings from Pre-trained Generative Audioencoders
Xingwei Sun, Heinrich Dinkel, Yadong Niu, Linzhang Wang, Junbo Zhang, Jian Luan

TL;DR
This paper presents a novel speech enhancement method that uses pre-trained generative audio encoders to produce higher quality speech from noisy inputs, outperforming existing models in both objective and subjective evaluations.
Contribution
The paper introduces a new speech enhancement approach leveraging pre-trained generative audio encoders and a vocoder, demonstrating improved performance and efficiency over discriminative models.
Findings
Outperforms discriminative audioencoder-based models in speech enhancement.
Achieves higher perceptual quality in subjective listening tests.
Uses fewer parameters with an efficient denoising encoder.
Abstract
Recent research has delved into speech enhancement (SE) approaches that leverage audio embeddings from pre-trained models, diverging from time-frequency masking or signal prediction techniques. This paper introduces an efficient and extensible SE method. Our approach involves initially extracting audio embeddings from noisy speech using a pre-trained audioencoder, which are then denoised by a compact encoder network. Subsequently, a vocoder synthesizes the clean speech from denoised embeddings. An ablation study substantiates the parameter efficiency of the denoise encoder with a pre-trained audioencoder and vocoder. Experimental results on both speech enhancement and speaker fidelity demonstrate that our generative audioencoder-based SE system outperforms models utilizing discriminative audioencoders. Furthermore, subjective listening tests validate that our proposed system surpasses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
