An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and   Separation

Daniel Michelsanti; Zheng-Hua Tan; Shi-Xiong Zhang; Yong Xu; Meng Yu,; Dong Yu; and Jesper Jensen

arXiv:2008.09586·eess.AS·March 16, 2021

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu,, Dong Yu, and Jesper Jensen

PDF

1 Repo

TL;DR

This paper systematically reviews deep learning methods for audio-visual speech enhancement and separation, covering features, fusion techniques, datasets, and evaluation methods, highlighting recent advances and applications in the field.

Contribution

It provides a comprehensive overview of deep learning-based audio-visual speech enhancement and separation, including techniques, datasets, and evaluation, filling a gap in the literature.

Findings

01

Deep learning significantly improves audio-visual speech separation performance.

02

Fusion of acoustic and visual features enhances speech enhancement accuracy.

03

Various datasets and evaluation methods are critical for benchmarking progress.

Abstract

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

danmic/av-se
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.