MuteSwap: Visual-informed Silent Video Identity Conversion
Yifan Liu, Yu Fang, Zhouhan Lin

TL;DR
MuteSwap is a novel visual-based framework for silent video voice conversion that generates intelligible speech and changes speaker identity solely from visual cues, outperforming audio-dependent methods especially in noisy environments.
Contribution
This work introduces MuteSwap, the first framework for silent face-based voice conversion using contrastive learning and mutual information minimization to achieve accurate speech synthesis and identity transfer.
Findings
MuteSwap outperforms audio-dependent methods in noisy conditions.
It achieves high speech intelligibility from silent videos.
The approach effectively separates visual features for identity and speech content.
Abstract
Conventional voice conversion modifies voice characteristics from a source speaker to a target speaker, relying on audio input from both sides. However, this process becomes infeasible when clean audio is unavailable, such as in silent videos or noisy environments. In this work, we focus on the task of Silent Face-based Voice Conversion (SFVC), which does voice conversion entirely from visual inputs. i.e., given images of a target speaker and a silent video of a source speaker containing lip motion, SFVC generates speech aligning the identity of the target speaker while preserving the speech content in the source silent video. As this task requires generating intelligible speech and converting identity using only visual cues, it is particularly challenging. To address this, we introduce MuteSwap, a novel framework that employs contrastive learning to align cross-modality identities and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
