Real-Time Person Image Synthesis Using a Flow Matching Model
Jiwoo Jeong, Kirok Kim, Wooju Kim, Nam-Joon Kim

TL;DR
This paper introduces a flow matching-based generative model that significantly improves the speed of person image synthesis conditioned on pose, enabling near-real-time performance while maintaining high image quality.
Contribution
The proposed flow matching model offers a faster, more stable, and efficient alternative to diffusion methods for pose-guided person image synthesis, supporting real-time applications.
Findings
Achieves near-real-time sampling speeds on DeepFashion dataset.
Maintains performance comparable to state-of-the-art models.
Trades slight accuracy decrease for over twofold speed increase.
Abstract
Pose-Guided Person Image Synthesis (PGPIS) generates realistic person images conditioned on a target pose and a source image. This task plays a key role in various real-world applications, such as sign language video generation, AR/VR, gaming, and live streaming. In these scenarios, real-time PGPIS is critical for providing immediate visual feedback and maintaining user immersion.However, achieving real-time performance remains a significant challenge due to the complexity of synthesizing high-fidelity images from diverse and dynamic human poses. Recent diffusion-based methods have shown impressive image quality in PGPIS, but their slow sampling speeds hinder deployment in time-sensitive applications. This latency is particularly problematic in tasks like generating sign language videos during live broadcasts, where rapid image updates are required. Therefore, developing a fast and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
