Improved Speech Reconstruction from Silent Video

Ariel Ephrat; Tavi Halperin; Shmuel Peleg

arXiv:1708.01204·cs.CV·August 31, 2017

Improved Speech Reconstruction from Silent Video

Ariel Ephrat, Tavi Halperin, Shmuel Peleg

PDF

TL;DR

This paper introduces a CNN-based model that converts silent video of a speaking person into intelligible, natural-sounding speech, showing significant improvements over previous methods and promising results for unconstrained speech reconstruction.

Contribution

The paper presents an end-to-end CNN model for speech reconstruction from silent video, achieving improved quality and intelligibility over existing approaches.

Findings

01

Significantly improved speech quality scores.

02

Effective reconstruction on GRID and TCD-TIMIT datasets.

03

Promising results for unconstrained speech reconstruction.

Abstract

Speechreading is the task of inferring phonetic information from visually observed articulatory facial movements, and is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible and natural-sounding acoustic speech signal from silent video frames of a speaking person. We train our model on speakers from the GRID and TCD-TIMIT datasets, and evaluate the quality and intelligibility of reconstructed speech using common objective measurements. We show that speech predictions from the proposed model attain scores which indicate significantly improved quality over existing models. In addition, we show promising results towards reconstructing speech from an unconstrained dictionary.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.