FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation
Weiting Tan, Andy T. Liu, Ming Tu, Xinghua Qu, Philipp Koehn, Lu Lu

TL;DR
FlowPortrait is a reinforcement learning framework that improves audio-driven portrait video generation by enhancing lip-sync, motion naturalness, and perceptual quality, using a novel multimodal evaluation system and stable training methods.
Contribution
It introduces a reinforcement learning approach with a multimodal evaluation system and stable reward optimization for high-quality talking-head video synthesis.
Findings
Outperforms existing methods in automatic and human evaluations
Achieves better lip-sync accuracy and motion naturalness
Demonstrates the effectiveness of reinforcement learning in portrait animation
Abstract
Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing
