FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
Taekyung Ki, Dongchan Min, Gyeongsu Chae

TL;DR
FLOAT introduces a flow matching generative model utilizing a learned motion latent space and transformer-based predictor to produce temporally consistent, expressive talking portrait videos efficiently, surpassing existing methods in quality and fidelity.
Contribution
The paper proposes a novel flow matching generative approach with a learned orthogonal motion latent space and a transformer-based predictor for improved video generation.
Findings
Outperforms state-of-the-art in visual quality and motion fidelity
Achieves faster sampling due to flow matching approach
Supports speech-driven emotion enhancement
Abstract
With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. Instead of a pixel-based latent space, we take advantage of a learned orthogonal motion latent space, enabling efficient generation and editing of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with an effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Human Motion and Animation · Speech Recognition and Synthesis
