# TTS Skins: Speaker Conversion via ASR

**Authors:** Adam Polyak, Lior Wolf, Yaniv Taigman

arXiv: 1904.08983 · 2020-07-28

## TL;DR

This paper introduces a convolutional neural network for speaker conversion that leverages pre-trained ASR encoders and autoregressive waveform decoding, enabling voice conversion without text dependence.

## Contribution

It presents a novel fully convolutional wav-to-wav network utilizing pre-trained ASR encoders for speaker conversion, bypassing the need for text transcriptions.

## Key findings

- Effective voice conversion demonstrated on audiobook data
- Able to generate multi-voice TTS from a single model
- Shows potential for speaker adaptation in TTS systems

## Abstract

We present a fully convolutional wav-to-wav network for converting between speakers' voices, without relying on text. Our network is based on an encoder-decoder architecture, where the encoder is pre-trained for the task of Automatic Speech Recognition, and a multi-speaker waveform decoder is trained to reconstruct the original signal in an autoregressive manner. We train the network on narrated audiobooks, and demonstrate multi-voice TTS in those voices, by converting the voice of a TTS robot.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.08983/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/1904.08983/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/1904.08983/full.md

---
Source: https://tomesphere.com/paper/1904.08983