PMMD: A pose-guided multi-view multi-modal diffusion for person generation

Ziyu Shang; Haoran Liu; Rongchao Zhang; Zhiqian Wei; Tongtong Feng

arXiv:2512.15069·cs.CV·December 18, 2025

PMMD: A pose-guided multi-view multi-modal diffusion for person generation

Ziyu Shang, Haoran Liu, Rongchao Zhang, Zhiqian Wei, Tongtong Feng

PDF

Open Access

TL;DR

PMMD is a novel diffusion-based framework that synthesizes photorealistic human images with controllable pose and appearance by integrating multi-view references, pose maps, and text prompts, addressing occlusion and misalignment issues.

Contribution

The paper introduces PMMD, a pose-guided multi-view multimodal diffusion model with a multimodal encoder, ResCVA module, and cross modal fusion, improving person image synthesis quality.

Findings

01

Outperforms baselines in consistency and detail preservation

02

Effectively handles occlusions and pose misalignments

03

Demonstrates superior controllability in person image generation

Abstract

Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on multi-view references, pose maps, and text prompts. A multimodal encoder jointly models visual views, pose features, and semantic descriptions, which reduces cross modal discrepancy and improves identity fidelity. We further design a ResCVA module to enhance local detail while preserving global structure, and a cross modal fusion module that integrates image semantics with text throughout the denoising pipeline. Experiments on the DeepFashion MultiModal dataset show that PMMD outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications