# Speech2Face: Learning the Face Behind a Voice

**Authors:** Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T., Freeman, Michael Rubinstein, Wojciech Matusik

arXiv: 1905.09773 · 2019-05-24

## TL;DR

This paper introduces Speech2Face, a deep learning model that reconstructs facial images from speech audio by leveraging large-scale internet video data, capturing physical attributes without explicit attribute modeling.

## Contribution

It presents a self-supervised approach to infer facial features from speech, learning voice-face correlations directly from natural co-occurring data.

## Key findings

- Reconstructed faces resemble true speaker images in physical attributes.
- Model captures age, gender, and ethnicity from speech.
- Effective without explicit attribute labels.

## Abstract

How much can we infer about a person's looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/YouTube videos of people speaking. During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly. We evaluate and numerically quantify how--and in what manner--our Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.09773/full.md

## Figures

18 figures with captions in the complete paper: https://tomesphere.com/paper/1905.09773/full.md

## References

54 references — full list in the complete paper: https://tomesphere.com/paper/1905.09773/full.md

---
Source: https://tomesphere.com/paper/1905.09773