Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings
I-Chun Chern, Kuo-Hsuan Hung, Yi-Ting Chen, Tassadaq Hussain, Mandar, Gogate, Amir Hussain, Yu Tsao, Jen-Cheng Hou

TL;DR
This paper demonstrates that multi-modal self-supervised embeddings from AV-HuBERT can be effectively used for real-world audio-visual speech enhancement and separation tasks, outperforming existing models.
Contribution
It introduces a novel approach leveraging AV-HuBERT embeddings with an SE module for AVSE and AVSS, showing improved performance over state-of-the-art methods.
Findings
Proposed model outperforms state-of-the-art AVSE models
Model achieves better results than traditional audio-only SE models
Multi-modal embeddings generalize well to AV regression tasks
Abstract
AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
