FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Taekyung Ki; Dongchan Min; Gyeongsu Chae

arXiv:2412.01064·cs.CV·September 22, 2025

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Taekyung Ki, Dongchan Min, Gyeongsu Chae

PDF

Open Access 1 Models

TL;DR

FLOAT introduces a flow matching generative model utilizing a learned motion latent space and transformer-based predictor to produce temporally consistent, expressive talking portrait videos efficiently, surpassing existing methods in quality and fidelity.

Contribution

The paper proposes a novel flow matching generative approach with a learned orthogonal motion latent space and a transformer-based predictor for improved video generation.

Findings

01

Outperforms state-of-the-art in visual quality and motion fidelity

02

Achieves faster sampling due to flow matching approach

03

Supports speech-driven emotion enhancement

Abstract

With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. Instead of a pixel-based latent space, we take advantage of a learned orthogonal motion latent space, enabling efficient generation and editing of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with an effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nzgnzg73/NZG_FLOAT_Optimized
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Human Motion and Animation · Speech Recognition and Synthesis