Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement
Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

TL;DR
This paper introduces ultrasound tongue images into audio-visual speech enhancement, using knowledge distillation and a lip-tongue memory network to improve speech quality and robustness without requiring ultrasound during inference.
Contribution
It proposes novel methods to incorporate tongue-related information into AV-SE, including knowledge distillation and a lip-tongue memory network, enhancing performance and generalization.
Findings
Significant improvement in speech quality and intelligibility.
Strong generalization to unseen speakers and noises.
Enhanced recognition of palatal and velar consonants.
Abstract
Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes the incorporation of ultrasound tongue images to improve the performance of lip-based AV-SE systems further. To address the challenge of acquiring ultrasound tongue images during inference, we first propose to employ knowledge distillation during training to investigate the feasibility of leveraging tongue-related information without directly inputting ultrasound tongue images. Specifically, we guide an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model, thus transferring tongue-related knowledge. To better model the alignment between the lip and tongue modalities, we further propose the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Advanced Adaptive Filtering Techniques
MethodsMemory Network · Knowledge Distillation
