Looking to Listen at the Cocktail Party: A Speaker-Independent   Audio-Visual Model for Speech Separation

Ariel Ephrat; Inbar Mosseri; Oran Lang; Tali Dekel; Kevin Wilson,; Avinatan Hassidim; William T. Freeman; Michael Rubinstein

arXiv:1804.03619·cs.SD·August 13, 2018·45 cites

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson,, Avinatan Hassidim, William T. Freeman, Michael Rubinstein

PDF

Open Access 4 Repos 1 Datasets

TL;DR

This paper introduces a deep audio-visual model that leverages visual cues to improve speech separation in noisy environments, demonstrating advantages over audio-only and speaker-dependent methods across various real-world scenarios.

Contribution

The authors develop a speaker-independent audio-visual speech separation model trained on a new large dataset, AVSpeech, outperforming existing methods in mixed speech scenarios.

Findings

01

Outperforms state-of-the-art audio-only speech separation methods.

02

Effective in real-world noisy environments like bars and interviews.

03

Speaker-independent model generalizes well to any speaker.

Abstract

We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

bbrothers/avspeech-metadata
dataset· 34 dl
34 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation