Improved Speech Reconstruction from Silent Video
Ariel Ephrat, Tavi Halperin, Shmuel Peleg

TL;DR
This paper introduces a CNN-based model that converts silent video of a speaking person into intelligible, natural-sounding speech, showing significant improvements over previous methods and promising results for unconstrained speech reconstruction.
Contribution
The paper presents an end-to-end CNN model for speech reconstruction from silent video, achieving improved quality and intelligibility over existing approaches.
Findings
Significantly improved speech quality scores.
Effective reconstruction on GRID and TCD-TIMIT datasets.
Promising results for unconstrained speech reconstruction.
Abstract
Speechreading is the task of inferring phonetic information from visually observed articulatory facial movements, and is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible and natural-sounding acoustic speech signal from silent video frames of a speaking person. We train our model on speakers from the GRID and TCD-TIMIT datasets, and evaluate the quality and intelligibility of reconstructed speech using common objective measurements. We show that speech predictions from the proposed model attain scores which indicate significantly improved quality over existing models. In addition, we show promising results towards reconstructing speech from an unconstrained dictionary.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
