Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model
Haoyu Li, Yang Ai, Junichi Yamagishi

TL;DR
This paper introduces a neural network system that enhances low-quality voice recordings by disentangling channel effects and synthesizing high-quality speech, outperforming existing methods.
Contribution
It proposes a novel encoder-decoder architecture with adversarial training to separate channel factors and improve speech quality in low-quality recordings.
Findings
Significantly improves speech quality over baseline systems
Effectively disentangles channel characteristics from audio
Generates professional-quality speech from low-quality inputs
Abstract
High-quality speech corpora are essential foundations for most speech applications. However, such speech data are expensive and limited since they are collected in professional recording environments. In this work, we propose an encoder-decoder neural network to automatically enhance low-quality recordings to professional high-quality recordings. To address channel variability, we first filter out the channel characteristics from the original input audio using the encoder network with adversarial training. Next, we disentangle the channel factor from a reference audio. Conditioned on this factor, an auto-regressive decoder is then used to predict the target-environment Mel spectrogram. Finally, we apply a neural vocoder to synthesize the speech waveform. Experimental results show that the proposed system can generate a professional high-quality speech waveform when setting high-quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
