The Conversation: Deep Audio-Visual Speech Enhancement

Triantafyllos Afouras; Joon Son Chung; Andrew Zisserman

arXiv:1804.04121·cs.CV·June 20, 2018

The Conversation: Deep Audio-Visual Speech Enhancement

Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

PDF

TL;DR

This paper introduces a deep audio-visual speech enhancement network capable of isolating individual speakers in multi-talker videos, even for unseen speakers and in unconstrained real-world environments.

Contribution

It presents a novel deep learning model that uses lip region videos to separate speakers' voices, predicting both magnitude and phase for effective speech enhancement.

Findings

01

Strong quantitative results on real-world data

02

Effective separation of unseen speakers

03

Robust performance in unconstrained environments

Abstract

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.