Speaking Without Sound: Multi-speaker Silent Speech Voicing with Facial Inputs Only

Jaejun Lee; Yoori Oh; and Kyogu Lee

arXiv:2602.01879·cs.SD·February 3, 2026

Speaking Without Sound: Multi-speaker Silent Speech Voicing with Facial Inputs Only

Jaejun Lee, Yoori Oh, and Kyogu Lee

PDF

Open Access

TL;DR

This paper presents a new method for multi-speaker silent speech generation using facial images and EMG signals, enabling speech synthesis without sound, with improved linguistic content extraction through pitch-disentanglement.

Contribution

It introduces a novel framework combining EMG and facial data for silent speech synthesis and proposes a pitch-disentangled embedding to enhance linguistic content extraction.

Findings

01

Successful multi-speaker speech generation without audible inputs

02

Effective pitch-disentanglement improves linguistic content extraction

03

Framework demonstrates potential for silent communication applications

Abstract

In this paper, we introduce a novel framework for generating multi-speaker speech without relying on any audible inputs. Our approach leverages silent electromyography (EMG) signals to capture linguistic content, while facial images are used to match with the vocal identity of the target speaker. Notably, we present a pitch-disentangled content embedding that enhances the extraction of linguistic content from EMG signals. Extensive analysis demonstrates that our method can generate multi-speaker speech without any audible inputs and confirms the effectiveness of the proposed pitch-disentanglement approach.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Emotion and Mood Recognition · Voice and Speech Disorders