emg2speech: Synthesizing speech from electromyography using self-supervised speech models
Harshavardhana T. Gowda, Daniel C. Comstock, and Lee M. Miller

TL;DR
This paper introduces a neuromuscular speech interface that converts EMG signals from facial muscles into speech audio using self-supervised speech models, enabling end-to-end EMG-to-speech synthesis.
Contribution
It demonstrates that self-supervised speech representations encode articulatory information and can be used for direct EMG-to-speech conversion without explicit modeling.
Findings
Linear mapping predicts EMG power from S3 representations with r=0.85.
EMG signals form structured, separable clusters corresponding to articulatory gestures.
End-to-end EMG-to-speech synthesis demonstrated with an ALS patient.
Abstract
We present a neuromuscular speech interface that translates electromyographic (EMG) signals recorded from orofacial muscles during speech articulation directly into audio. We find that self-supervised speech (S3) representations are strongly linearly related to the electrical power of muscle activity: a simple linear mapping predicts EMG power from S3 representations with a correlation of r = 0.85. In addition, EMG power vectors associated with distinct articulatory gestures form structured, separable clusters. Together, these observations suggest that S3 models implicitly encode articulatory mechanisms, as reflected in EMG activity. Leveraging this structure, we map EMG signals into the S3 representation space and synthesize speech, enabling end-to-end EMG-to-speech generation without explicit articulatory modeling or vocoder training. We demonstrate this system with a participant with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
