DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On
Wengyi Zhan, Mingbao Lin, Shuicheng Yan, Rongrong Ji

TL;DR
DiffusionTrend introduces a resource-efficient, training-free diffusion model approach for virtual fashion try-on, leveraging latent information and a lightweight CNN to produce visually compelling results without retraining or complex inputs.
Contribution
It presents a novel method that avoids retraining diffusion models and simplifies user inputs, advancing virtual try-on technology with a training-free diffusion approach.
Findings
Achieves visually compelling virtual try-on results.
Avoids resource-intensive diffusion model retraining.
Simplifies the input requirements for virtual try-on.
Abstract
We introduce DiffusionTrend for virtual fashion try-on, which forgoes the need for retraining diffusion models. Using advanced diffusion models, DiffusionTrend harnesses latent information rich in prior information to capture the nuances of garment details. Throughout the diffusion denoising process, these details are seamlessly integrated into the model image generation, expertly directed by a precise garment mask crafted by a lightweight and compact CNN. Although our DiffusionTrend model initially demonstrates suboptimal metric performance, our exploratory approach offers some important advantages: (1) It circumvents resource-intensive retraining of diffusion models on large datasets. (2) It eliminates the necessity for various complex and user-unfriendly model inputs. (3) It delivers a visually compelling try-on experience, underscoring the potential of training-free diffusion model.…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
Simple and efficient method Mixes image-space deformation with neural generation, which is refreshing Training free method
Results are underwhelming Limited innovation No user study Can handle simple cases with simple background and lighting
The strength of this paper lies in their argument for a resource constrained setup to achieve virtual try-on.
In my humble opinion, the biggest drawback lies in the assumption this paper makes about not using accurate off-the-shelf segmentation methods. SDXL base model which is likely used in this paper has around 3.5 billion parameters, while a SAM-B [1] has around 94.7 million parameters and SAM2 [2] Hiera Tiny has around 38.9 million parameters. These segmentation methods are significantly less computationally intensive as the base diffusion model. In my understanding of the proposal here, the segmen
1. The paper addresses the significant training cost issue of existing virtual try-on models and investigates low-cost methods for conducting try-ons, a valuable subject. The authors explore a viable approach to perform try-ons through the replacement of latents. Unfortunately, the method seems somewhat rudimentary, and there is room for improvement in the quality of the results. 2. The paper is well-articulated, and the data presented is credible.
1. The statement in line 161 "Traditional segmentation models (He et al., 2017; Kirillov et al., 2023) are typically employed to generate masks. However, in environments where numerous users engage in online consultations simultaneously, computational efficiency becomes crucial. For instance, Segment Anything (Kirillov et al., 2023), while effective, incurs a significant computational cost , which can be impractical for real-world applications, especially when scalability and cost-effectiveness
(+) The writing is easy to follow.
(-) The motivations are defective. i) Most SOTA methods only fine-tune the diffusion models and are not too computationally extensive. ii) The densepose/segment map/clothes-agnostic representations/keypoints are not additional inputs, but intermediate results from existing tools, and therefore do not complicate use at all. iii) There are many existing human parsing models (no need to use SegmentAnything) that are powerful and lightweight, which can be obtained by googling the keyword "human pars
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Fashion and Cultural Textiles · Human Motion and Animation
MethodsDiffusion
