Digital Voicing of Silent Speech
David Gaddy, Dan Klein

TL;DR
This paper presents a novel approach to convert silent speech into audible speech using EMG signals collected during silent articulation, significantly improving intelligibility over previous methods.
Contribution
First to train speech synthesis models from EMG during silent speech, using transfer learning from vocalized speech, and providing a new dataset for the community.
Findings
Word error rate reduced from 64% to 4% with silent EMG training.
Demonstrated significant improvement over baseline in speech intelligibility.
Shared a new dataset of silent and vocalized EMG measurements.
Abstract
In this paper, we consider the task of digitally voicing silent speech, where silently mouthed words are converted to audible speech based on electromyography (EMG) sensor measurements that capture muscle impulses. While prior work has focused on training speech synthesis models from EMG collected during vocalized speech, we are the first to train from EMG collected during silently articulated speech. We introduce a method of training on silent EMG by transferring audio targets from vocalized to silent signals. Our method greatly improves intelligibility of audio generated from silent EMG compared to a baseline that only trains with vocalized data, decreasing transcription word error rate from 64% to 4% in one data condition and 88% to 68% in another. To spur further development on this task, we share our new dataset of silent and vocalized facial EMG measurements.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Speech and dialogue systems
