TL;DR
USpeech introduces a novel cross-modal ultrasound synthesis framework that enhances speech with minimal human effort by leveraging visual and audio data, overcoming data scarcity and heterogeneity issues.
Contribution
The paper presents a two-stage framework combining contrastive pre-training and ultrasound synthesis to improve ultrasound-based speech enhancement without extensive data collection.
Findings
Synthetic ultrasound data achieves comparable performance to physical data.
USpeech outperforms existing ultrasound speech enhancement methods.
Framework effectively overcomes data scarcity and heterogeneity challenges.
Abstract
Speech enhancement is crucial for ubiquitous human-computer interaction. Recently, ultrasound-based acoustic sensing has emerged as an attractive choice for speech enhancement because of its superior ubiquity and performance. However, due to inevitable interference from unexpected and unintended sources during audio-ultrasound data acquisition, existing solutions rely heavily on human effort for data collection and processing. This leads to significant data scarcity that limits the full potential of ultrasound-based speech enhancement. To address this, we propose USpeech, a cross-modal ultrasound synthesis framework for speech enhancement with minimal human effort. At its core is a two-stage framework that establishes the correspondence between visual and ultrasonic modalities by leveraging audio as a bridge. This approach overcomes challenges from the lack of paired video-ultrasound…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
