Audio-Visual Speech Inpainting with Deep Learning
Giovanni Morrone, Daniel Michelsanti, Zheng-Hua Tan, Jesper Jensen

TL;DR
This paper introduces a deep learning framework for audio-visual speech inpainting that leverages visual cues to restore missing speech segments, outperforming audio-only methods especially for longer gaps.
Contribution
The paper presents a novel deep learning approach combining audio and visual data for speech inpainting, demonstrating the effectiveness of multi-task learning with phone recognition.
Findings
Audio-visual inpainting outperforms audio-only methods for large gaps.
Multi-task learning with phone recognition improves inpainting quality.
Visual information significantly enhances speech restoration for gaps up to 1600 ms.
Abstract
In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsInpainting
