FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

Weiting Tan; Andy T. Liu; Ming Tu; Xinghua Qu; Philipp Koehn; Lu Lu

arXiv:2603.00159·cs.CV·March 3, 2026

FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

Weiting Tan, Andy T. Liu, Ming Tu, Xinghua Qu, Philipp Koehn, Lu Lu

PDF

Open Access

TL;DR

FlowPortrait is a reinforcement learning framework that improves audio-driven portrait video generation by enhancing lip-sync, motion naturalness, and perceptual quality, using a novel multimodal evaluation system and stable training methods.

Contribution

It introduces a reinforcement learning approach with a multimodal evaluation system and stable reward optimization for high-quality talking-head video synthesis.

Findings

01

Outperforms existing methods in automatic and human evaluations

02

Achieves better lip-sync accuracy and motion naturalness

03

Demonstrates the effectiveness of reinforcement learning in portrait animation

Abstract

Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing