Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal   Self-Supervised Embeddings

I-Chun Chern; Kuo-Hsuan Hung; Yi-Ting Chen; Tassadaq Hussain; Mandar; Gogate; Amir Hussain; Yu Tsao; Jen-Cheng Hou

arXiv:2210.17456·eess.AS·June 2, 2023·1 cites

Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings

I-Chun Chern, Kuo-Hsuan Hung, Yi-Ting Chen, Tassadaq Hussain, Mandar, Gogate, Amir Hussain, Yu Tsao, Jen-Cheng Hou

PDF

Open Access

TL;DR

This paper demonstrates that multi-modal self-supervised embeddings from AV-HuBERT can be effectively used for real-world audio-visual speech enhancement and separation tasks, outperforming existing models.

Contribution

It introduces a novel approach leveraging AV-HuBERT embeddings with an SE module for AVSE and AVSS, showing improved performance over state-of-the-art methods.

Findings

01

Proposed model outperforms state-of-the-art AVSE models

02

Model achieves better results than traditional audio-only SE models

03

Multi-modal embeddings generalize well to AV regression tasks

Abstract

AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing