V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

Jeongsoo Choi; Ji-Hoon Kim; Jinyu Li; Joon Son Chung; Shujie Liu

arXiv:2411.19486·cs.CV·June 2, 2025

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

Jeongsoo Choi, Ji-Hoon Kim, Jinyu Li, Joon Son Chung, Shujie Liu

PDF

Open Access 1 Repo

TL;DR

V2SFlow introduces a novel framework that decomposes speech into subspaces and employs a rectified flow decoder to generate natural speech from silent videos, outperforming existing methods especially in unconstrained scenarios.

Contribution

The paper presents a new V2S framework using speech decomposition and rectified flow modeling, improving robustness and naturalness in real-world conditions.

Findings

01

Outperforms state-of-the-art V2S methods

02

Generates speech with higher naturalness than ground truth

03

Effective on unconstrained, real-world datasets

Abstract

In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture, which models efficient probabilistic pathways from random noise to the target speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kaistmm/v2sflow
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Speech and dialogue systems

MethodsAttention Is All You Need · Residual Connection · Softmax · Adam · Label Smoothing · Dropout · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding