# Learning Latent Representations for Speech Generation and Transformation

**Authors:** Wei-Ning Hsu, Yu Zhang, James Glass

arXiv: 1704.04222 · 2017-09-25

## TL;DR

This paper introduces a convolutional VAE model for unsupervised learning of speech representations, enabling manipulation of phonetic content and speaker identity without labeled data.

## Contribution

It applies a convolutional VAE to speech, deriving latent space operations for disentangling and modifying speech attributes without supervision.

## Key findings

- Successfully models speech generative process
- Enables modification of phonetic content and speaker identity
- Operates without parallel labeled data

## Abstract

An ability to model a generative process and learn a latent representation for speech in an unsupervised fashion will be crucial to process vast quantities of unlabelled speech data. Recently, deep probabilistic generative models such as Variational Autoencoders (VAEs) have achieved tremendous success in modeling natural images. In this paper, we apply a convolutional VAE to model the generative process of natural speech. We derive latent space arithmetic operations to disentangle learned latent representations. We demonstrate the capability of our model to modify the phonetic content or the speaker identity for speech segments using the derived operations, without the need for parallel supervisory data.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1704.04222/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/1704.04222/full.md

## References

22 references — full list in the complete paper: https://tomesphere.com/paper/1704.04222/full.md

---
Source: https://tomesphere.com/paper/1704.04222