Fusion Embedding for Pose-Guided Person Image Synthesis with Diffusion Model
Donghwna Lee, Kyungha Min, Kirok Kim, Seyoung Jeong, Jiwoo Jeong,, Wooju Kim

TL;DR
This paper introduces FPDM, a two-stage fusion embedding approach using diffusion models for pose-guided person image synthesis, achieving state-of-the-art results on benchmark datasets.
Contribution
Proposes a novel two-stage fusion embedding method for PGPIS leveraging pre-trained CLIP models, simplifying the model structure and improving synthesis quality.
Findings
Achieves SOTA performance on DeepFashion and RWTH-PHOENIX datasets.
Even a simplified model with only the second stage performs competitively.
Demonstrates the effectiveness of fusion embedding in preserving appearance and pose accuracy.
Abstract
Pose-Guided Person Image Synthesis (PGPIS) aims to synthesize high-quality person images corresponding to target poses while preserving the appearance of the source image. Recently, PGPIS methods that use diffusion models have achieved competitive performance. Most approaches involve extracting representations of the target pose and source image and learning their relationships in the generative model's training process. This approach makes it difficult to learn the semantic relationships between the input and target images and complicates the model structure needed to enhance generation results. To address these issues, we propose Fusion embedding for PGPIS using a Diffusion Model (FPDM). Inspired by the successful application of pre-trained CLIP models in text-to-image diffusion models, our method consists of two stages. The first stage involves training the fusion embedding of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Face recognition and analysis · Advanced Image and Video Retrieval Techniques
MethodsALIGN · Diffusion · Contrastive Language-Image Pre-training
