FashionSD-X: Multimodal Fashion Garment Synthesis using Latent Diffusion
Abhishek Kumar Singh, Ioannis Patras

TL;DR
FashionSD-X introduces a multimodal latent diffusion model that generates high-quality fashion images from text and sketches, enhancing design workflows with improved realism and control.
Contribution
This paper presents a novel generative pipeline combining ControlNet and LoRA fine-tuning for multimodal fashion image synthesis, outperforming traditional diffusion models.
Findings
Significantly better FID, CLIP Score, and KID metrics than baseline models.
Effective integration of sketch data improves fashion image realism.
Demonstrates potential for interactive and personalized fashion design applications.
Abstract
The rapid evolution of the fashion industry increasingly intersects with technological advancements, particularly through the integration of generative AI. This study introduces a novel generative pipeline designed to transform the fashion design process by employing latent diffusion models. Utilizing ControlNet and LoRA fine-tuning, our approach generates high-quality images from multimodal inputs such as text and sketches. We leverage and enhance state-of-the-art virtual try-on datasets, including Multimodal Dress Code and VITON-HD, by integrating sketch data. Our evaluation, utilizing metrics like FID, CLIP Score, and KID, demonstrates that our model significantly outperforms traditional stable diffusion models. The results not only highlight the effectiveness of our model in generating fashion-appropriate outputs but also underscore the potential of diffusion models in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Human Motion and Animation · Fashion and Cultural Textiles
MethodsContrastive Language-Image Pre-training · Diffusion
