TL;DR
This paper systematically reviews deep learning methods for audio-visual speech enhancement and separation, covering features, fusion techniques, datasets, and evaluation methods, highlighting recent advances and applications in the field.
Contribution
It provides a comprehensive overview of deep learning-based audio-visual speech enhancement and separation, including techniques, datasets, and evaluation, filling a gap in the literature.
Findings
Deep learning significantly improves audio-visual speech separation performance.
Fusion of acoustic and visual features enhances speech enhancement accuracy.
Various datasets and evaluation methods are critical for benchmarking progress.
Abstract
Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
