An Empirical Study of Visual Features for DNN based Audio-Visual Speech Enhancement in Multi-talker Environments
Shrishti Saha Shetu, Soumitro Chakrabarty, Emanu\"el A. P. Habets

TL;DR
This paper empirically compares various visual features for deep neural network-based audio-visual speech enhancement in multi-talker environments, highlighting trade-offs between performance and computational complexity.
Contribution
It provides the first comprehensive analysis of visual feature choices for AVSE, evaluating their impact and pre-processing requirements.
Findings
Embedding-based features perform best overall.
Optical flow and raw pixels are more suitable for low-resource systems.
Pre-processing complexity varies significantly among features.
Abstract
Audio-visual speech enhancement (AVSE) methods use both audio and visual features for the task of speech enhancement and the use of visual features has been shown to be particularly effective in multi-speaker scenarios. In the majority of deep neural network (DNN) based AVSE methods, the audio and visual data are first processed separately using different sub-networks, and then the learned features are fused to utilize the information from both modalities. There have been various studies on suitable audio input features and network architectures, however, to the best of our knowledge, there is no published study that has investigated which visual features are best suited for this specific task. In this work, we perform an empirical study of the most commonly used visual features for DNN based AVSE, the pre-processing requirements for each of these features, and investigate their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
