Seeing Through Noise: Visually Driven Speaker Separation and Enhancement
Aviv Gabbay, Ariel Ephrat, Tavi Halperin, Shmuel Peleg

TL;DR
This paper introduces an audio-visual approach that uses facial motion cues from silent video to isolate and enhance a speaker's voice in noisy environments, outperforming audio-only methods.
Contribution
The novel integration of video-to-speech neural networks with audio filtering for speaker separation without relying on sound mixtures during training.
Findings
Significant SDR improvements over raw video-to-speech predictions.
Enhanced PESQ scores indicating better audio quality.
Outperforms existing audio-only speaker separation methods.
Abstract
Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments. We propose audio-visual methods to isolate the voice of a single speaker and eliminate unrelated sounds. First, face motions captured in the video are used to estimate the speaker's voice, by passing the silent video frames through a video-to-speech neural network-based model. Then the speech predictions are applied as a filter on the noisy input audio. This approach avoids using mixtures of sounds in the learning process, as the number of such possible mixtures is huge, and would inevitably bias the trained model. We evaluate our method on two audio-visual datasets, GRID and TCD-TIMIT, and show that our method attains significant SDR and PESQ improvements over the raw video-to-speech predictions, and a well-known audio-only method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
