End-to-End Binaural Speech Synthesis

Wen Chin Huang; Dejan Markovic; Alexander Richard; Israel Dejene Gebru; and Anjali Menon

arXiv:2207.03697·cs.SD·July 11, 2022

End-to-End Binaural Speech Synthesis

Wen Chin Huang, Dejan Markovic, Alexander Richard, Israel Dejene Gebru, and Anjali Menon

PDF

Open Access

TL;DR

This paper introduces an end-to-end binaural speech synthesis system that effectively combines audio coding with realistic environmental sound reproduction, utilizing a modified VQ-VAE trained with multiple objectives including adversarial loss.

Contribution

It presents a novel binaural synthesis model that accurately reproduces environmental factors and ambient noise, outperforming previous methods in realism and fidelity.

Findings

01

System closely matches ground truth data

02

Adversarial loss improves environmental effect capture

03

Outperforms previous binaural synthesis methods

Abstract

In this work, we present an end-to-end binaural speech synthesis system that combines a low-bitrate audio codec with a powerful binaural decoder that is capable of accurate speech binauralization while faithfully reconstructing environmental factors like ambient noise or reverb. The network is a modified vector-quantized variational autoencoder, trained with several carefully designed objectives, including an adversarial loss. We evaluate the proposed system on an internal binaural dataset with objective metrics and a perceptual study. Results show that the proposed approach matches the ground truth data more closely than previous methods. In particular, we demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing