DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On

Wengyi Zhan; Mingbao Lin; Shuicheng Yan; Rongrong Ji

arXiv:2412.14465·cs.CV·June 2, 2025

DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On

Wengyi Zhan, Mingbao Lin, Shuicheng Yan, Rongrong Ji

PDF

Open Access 4 Reviews

TL;DR

DiffusionTrend introduces a resource-efficient, training-free diffusion model approach for virtual fashion try-on, leveraging latent information and a lightweight CNN to produce visually compelling results without retraining or complex inputs.

Contribution

It presents a novel method that avoids retraining diffusion models and simplifies user inputs, advancing virtual try-on technology with a training-free diffusion approach.

Findings

01

Achieves visually compelling virtual try-on results.

02

Avoids resource-intensive diffusion model retraining.

03

Simplifies the input requirements for virtual try-on.

Abstract

We introduce DiffusionTrend for virtual fashion try-on, which forgoes the need for retraining diffusion models. Using advanced diffusion models, DiffusionTrend harnesses latent information rich in prior information to capture the nuances of garment details. Throughout the diffusion denoising process, these details are seamlessly integrated into the model image generation, expertly directed by a precise garment mask crafted by a lightweight and compact CNN. Although our DiffusionTrend model initially demonstrates suboptimal metric performance, our exploratory approach offers some important advantages: (1) It circumvents resource-intensive retraining of diffusion models on large datasets. (2) It eliminates the necessity for various complex and user-unfriendly model inputs. (3) It delivers a visually compelling try-on experience, underscoring the potential of training-free diffusion model.…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 4

Strengths

Simple and efficient method Mixes image-space deformation with neural generation, which is refreshing Training free method

Weaknesses

Results are underwhelming Limited innovation No user study Can handle simple cases with simple background and lighting

Reviewer 02Rating 1Confidence 4

Strengths

The strength of this paper lies in their argument for a resource constrained setup to achieve virtual try-on.

Weaknesses

In my humble opinion, the biggest drawback lies in the assumption this paper makes about not using accurate off-the-shelf segmentation methods. SDXL base model which is likely used in this paper has around 3.5 billion parameters, while a SAM-B [1] has around 94.7 million parameters and SAM2 [2] Hiera Tiny has around 38.9 million parameters. These segmentation methods are significantly less computationally intensive as the base diffusion model. In my understanding of the proposal here, the segmen

Reviewer 03Rating 5Confidence 5

Strengths

1. The paper addresses the significant training cost issue of existing virtual try-on models and investigates low-cost methods for conducting try-ons, a valuable subject. The authors explore a viable approach to perform try-ons through the replacement of latents. Unfortunately, the method seems somewhat rudimentary, and there is room for improvement in the quality of the results. 2. The paper is well-articulated, and the data presented is credible.

Weaknesses

1. The statement in line 161 "Traditional segmentation models (He et al., 2017; Kirillov et al., 2023) are typically employed to generate masks. However, in environments where numerous users engage in online consultations simultaneously, computational efficiency becomes crucial. For instance, Segment Anything (Kirillov et al., 2023), while effective, incurs a significant computational cost , which can be impractical for real-world applications, especially when scalability and cost-effectiveness

Reviewer 04Rating 3Confidence 5

Strengths

(+) The writing is easy to follow.

Weaknesses

(-) The motivations are defective. i) Most SOTA methods only fine-tune the diffusion models and are not too computationally extensive. ii) The densepose/segment map/clothes-agnostic representations/keypoints are not additional inputs, but intermediate results from existing tools, and therefore do not complicate use at all. iii) There are many existing human parsing models (no need to use SegmentAnything) that are powerful and lightweight, which can be obtained by googling the keyword "human pars

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Fashion and Cultural Textiles · Human Motion and Animation

MethodsDiffusion