GarmentPainter: Efficient 3D Garment Texture Synthesis with Character-Guided Diffusion Model
Jinbo Wu, Xiaobo Gao, Xing Liu, Chen Zhao, Jialun Liu

TL;DR
GarmentPainter is a novel framework that efficiently synthesizes high-quality, 3D-consistent garment textures in UV space, leveraging character-guided diffusion models for improved control and scalability.
Contribution
It introduces a UV-based guidance method and a type selection module for fine-grained, character-aware garment texture synthesis without requiring mesh-image alignment.
Findings
Achieves state-of-the-art visual fidelity and 3D consistency.
Demonstrates high computational efficiency compared to prior methods.
Provides flexible, component-specific texture generation without strict alignment.
Abstract
Generating high-fidelity, 3D-consistent garment textures remains a challenging problem due to the inherent complexities of garment structures and the stringent requirement for detailed, globally consistent texture synthesis. Existing approaches either rely on 2D-based diffusion models, which inherently struggle with 3D consistency, require expensive multi-step optimization or depend on strict spatial alignment between 2D reference images and 3D meshes, which limits their flexibility and scalability. In this work, we introduce GarmentPainter, a simple yet efficient framework for synthesizing high-quality, 3D-aware garment textures in UV space. Our method leverages a UV position map as the 3D structural guidance, ensuring texture consistency across the garment surface during texture generation. To enhance control and adaptability, we introduce a type selection module, enabling…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Simple and practical design: Minimal modifications to a standard inpainting diffusion backbone (channel concatenation + type embedding) make the method easy to implement and deploy. 2. Fast inference: Reported end-to-end UV generation is notably fast (single forward path), which is attractive for production pipelines compared with multi-view/iterative methods. 3. Workflow alignment: Accepting a person-in-context reference image maps well to real authoring scenarios, reducing pre- and post-p
1. Fairness & reproducibility: Different baselines appear to be run under different input protocols (e.g., prompts, masks, background handling, illumination). This can bias comparisons. A single unified evaluation protocol (resolution, masking, prompts, backgrounds/lighting) and a reproducible package would strengthen claims. 2. 3D consistency metrics are thin: The evaluation focuses on image-space metrics (e.g., FID/KID) and runtime. It lacks direct measures of UV seam continuity, cross-view c
* The problem of 3D garment generation is an important problem for industry applicatioons. * The proposed garment dataset with high-quality garments should be useful to the 3D community. * The proposed algorithm obtain promising results with sufficient ablation studies.
* More implementation details should be provided to facilitate the reproduction of the paper. Or it would be better to provide the code for reproduction. * For the experimental results in Table 1, I would suggest to provide more comparisons against the papers published in the recent two years or in the year of 2025. * In the experiments, the evaluation metric is based on FID and KID, which may not be consistent with human subjective evaluations. Thus, is it possible to provide a user study to
- **Data Contribution**: The authors curate a garment-specific dataset with UV maps, reference images, and mask/position data, which is valuable for this niche area of 3D garment texturing. - **Structural Innovation on SD1.5**: The way the authors adapt SD1.5 — particularly replacing text cross-attention with multi-modal VAE latent conditioning — is a novel and neat architectural modification that simplifies conditioning without heavy architectural changes. - **Experimental Soundness**: The abla
**Concerns on Generalization** > *[Sec.3 L206-209]* “Ultimately, we curate a dataset comprising 7,579 clothing items, including 3,703 tops, 2,114 bottoms, and 1,762 one-piece garments.” - Although the dataset creation is commendable, the total scale (~7.6k) appears small relative to the architectural modifications made to SD1.5 (multi-modal VAE inputs, removal of text cross-attention). It raises the question of whether such a limited dataset is sufficient to grant **true generalization** rather
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Computer Graphics and Visualization Techniques
