PMMD: A pose-guided multi-view multi-modal diffusion for person generation
Ziyu Shang, Haoran Liu, Rongchao Zhang, Zhiqian Wei, Tongtong Feng

TL;DR
PMMD is a novel diffusion-based framework that synthesizes photorealistic human images with controllable pose and appearance by integrating multi-view references, pose maps, and text prompts, addressing occlusion and misalignment issues.
Contribution
The paper introduces PMMD, a pose-guided multi-view multimodal diffusion model with a multimodal encoder, ResCVA module, and cross modal fusion, improving person image synthesis quality.
Findings
Outperforms baselines in consistency and detail preservation
Effectively handles occlusions and pose misalignments
Demonstrates superior controllability in person image generation
Abstract
Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on multi-view references, pose maps, and text prompts. A multimodal encoder jointly models visual views, pose features, and semantic descriptions, which reduces cross modal discrepancy and improves identity fidelity. We further design a ResCVA module to enhance local detail while preserving global structure, and a cross modal fusion module that integrates image semantics with text throughout the denoising pipeline. Experiments on the DeepFashion MultiModal dataset show that PMMD outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications
