End-to-End Binaural Speech Synthesis
Wen Chin Huang, Dejan Markovic, Alexander Richard, Israel Dejene Gebru, and Anjali Menon

TL;DR
This paper introduces an end-to-end binaural speech synthesis system that effectively combines audio coding with realistic environmental sound reproduction, utilizing a modified VQ-VAE trained with multiple objectives including adversarial loss.
Contribution
It presents a novel binaural synthesis model that accurately reproduces environmental factors and ambient noise, outperforming previous methods in realism and fidelity.
Findings
System closely matches ground truth data
Adversarial loss improves environmental effect capture
Outperforms previous binaural synthesis methods
Abstract
In this work, we present an end-to-end binaural speech synthesis system that combines a low-bitrate audio codec with a powerful binaural decoder that is capable of accurate speech binauralization while faithfully reconstructing environmental factors like ambient noise or reverb. The network is a modified vector-quantized variational autoencoder, trained with several carefully designed objectives, including an adversarial loss. We evaluate the proposed system on an internal binaural dataset with objective metrics and a perceptual study. Results show that the proposed approach matches the ground truth data more closely than previous methods. In particular, we demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
