Speaking Without Sound: Multi-speaker Silent Speech Voicing with Facial Inputs Only
Jaejun Lee, Yoori Oh, and Kyogu Lee

TL;DR
This paper presents a new method for multi-speaker silent speech generation using facial images and EMG signals, enabling speech synthesis without sound, with improved linguistic content extraction through pitch-disentanglement.
Contribution
It introduces a novel framework combining EMG and facial data for silent speech synthesis and proposes a pitch-disentangled embedding to enhance linguistic content extraction.
Findings
Successful multi-speaker speech generation without audible inputs
Effective pitch-disentanglement improves linguistic content extraction
Framework demonstrates potential for silent communication applications
Abstract
In this paper, we introduce a novel framework for generating multi-speaker speech without relying on any audible inputs. Our approach leverages silent electromyography (EMG) signals to capture linguistic content, while facial images are used to match with the vocal identity of the target speaker. Notably, we present a pitch-disentangled content embedding that enhances the extraction of linguistic content from EMG signals. Extensive analysis demonstrates that our method can generate multi-speaker speech without any audible inputs and confirms the effectiveness of the proposed pitch-disentanglement approach.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Emotion and Mood Recognition · Voice and Speech Disorders
