The Conversation: Deep Audio-Visual Speech Enhancement
Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

TL;DR
This paper introduces a deep audio-visual speech enhancement network capable of isolating individual speakers in multi-talker videos, even for unseen speakers and in unconstrained real-world environments.
Contribution
It presents a novel deep learning model that uses lip region videos to separate speakers' voices, predicting both magnitude and phase for effective speech enhancement.
Findings
Strong quantitative results on real-world data
Effective separation of unseen speakers
Robust performance in unconstrained environments
Abstract
Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
