# Speaker-Independent Speech-Driven Visual Speech Synthesis using   Domain-Adapted Acoustic Models

**Authors:** Ahmed Hussen Abdelaziz, Barry-John Theobald, Justin Binder, Gabriele, Fanelli, Paul Dixon, Nicholas Apostoloff, Thibaut Weise, Sachin Kajareker

arXiv: 1905.06860 · 2019-05-17

## TL;DR

This paper demonstrates that adapting an ASR acoustic model trained on large audio data to the visual speech synthesis domain improves lip animation quality, showing significant viewer preference over random initialization.

## Contribution

The study introduces a novel domain adaptation method for acoustic models to enhance speaker-independent visual speech synthesis using limited synchronized audio-visual data.

## Key findings

- AM-initialized DNNs are preferred by viewers over random-initialized models.
- Adapting ASR models to visual speech improves synthesis quality.
- Large-scale audio training benefits visual speech synthesis.

## Abstract

Speech-driven visual speech synthesis involves mapping features extracted from acoustic speech to the corresponding lip animation controls for a face model. This mapping can take many forms, but a powerful approach is to use deep neural networks (DNNs). However, a limitation is the lack of synchronized audio, video, and depth data required to reliably train the DNNs, especially for speaker-independent models. In this paper, we investigate adapting an automatic speech recognition (ASR) acoustic model (AM) for the visual speech synthesis problem. We train the AM on ten thousand hours of audio-only data. The AM is then adapted to the visual speech synthesis domain using ninety hours of synchronized audio-visual speech. Using a subjective assessment test, we compared the performance of the AM-initialized DNN to one with a random initialization. The results show that viewers significantly prefer animations generated from the AM-initialized DNN than the ones generated using the randomly initialized model. We conclude that visual speech synthesis can significantly benefit from the powerful representation of speech in the ASR acoustic models.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.06860/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/1905.06860/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/1905.06860/full.md

---
Source: https://tomesphere.com/paper/1905.06860