# Unsupervised acoustic unit discovery for speech synthesis using discrete   latent-variable neural networks

**Authors:** Ryan Eloff, Andr\'e Nortje, Benjamin van Niekerk, Avashna Govender,, Leanne Nortje, Arnu Pretorius, Elan van Biljon, Ewald van der Westhuizen,, Lisa van Staden, Herman Kamper

arXiv: 1904.07556 · 2019-07-01

## TL;DR

This paper presents an unsupervised method using discrete latent-variable neural networks for discovering acoustic units in speech, enabling high-quality speech synthesis without labeled data, with applications in low-resource and infant phonetic studies.

## Contribution

It introduces a novel approach combining VQ-VAE and speaker conditioning for unsupervised acoustic unit discovery and speech synthesis, outperforming previous methods in quality.

## Key findings

- Decoupled speaker conditioning improves acoustic unit representations.
- VQ-VAE-based discretisation yields high-quality synthesis.
- The method performs well on multiple languages.

## Abstract

For our submission to the ZeroSpeech 2019 challenge, we apply discrete latent-variable neural networks to unlabelled speech and use the discovered units for speech synthesis. Unsupervised discrete subword modelling could be useful for studies of phonetic category learning in infants or in low-resource speech technology requiring symbolic input. We use an autoencoder (AE) architecture with intermediate discretisation. We decouple acoustic unit discovery from speaker modelling by conditioning the AE's decoder on the training speaker identity. At test time, unit discovery is performed on speech from an unseen speaker, followed by unit decoding conditioned on a known target speaker to obtain reconstructed filterbanks. This output is fed to a neural vocoder to synthesise speech in the target speaker's voice. For discretisation, categorical variational autoencoders (CatVAEs), vector-quantised VAEs (VQ-VAEs) and straight-through estimation are compared at different compression levels on two languages. Our final model uses convolutional encoding, VQ-VAE discretisation, deconvolutional decoding and an FFTNet vocoder. We show that decoupled speaker conditioning intrinsically improves discrete acoustic representations, yielding competitive synthesis quality compared to the challenge baseline.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.07556/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/1904.07556/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/1904.07556/full.md

---
Source: https://tomesphere.com/paper/1904.07556