Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On
Siqi Wan, Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei

TL;DR
This paper introduces a novel method for virtual try-on using diffusion models that explicitly incorporate visual correspondence and 3D-aware cues to better preserve garment details and shape during image synthesis.
Contribution
The approach explicitly leverages semantic point matching and 3D cues to improve diffusion-based virtual try-on, achieving state-of-the-art results.
Findings
Strong garment detail preservation demonstrated
State-of-the-art performance on VITON-HD and DressCode datasets
Effective use of semantic point matching and 3D cues
Abstract
Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper leverages semantic point matching as a prior to enhance garment shape and texture preservation. - Extensive testing on the VITON-HD and DressCode datasets demonstrates the model's robustness and superior performance. - The authors have provided the code to ensure reproducibility of their results.
- In line 254, local flow warping is used as a method to associate semantic points with their counterparts on the target person. It would be better to provide more detail on the local flow warping process for better understanding. - For Figure 5, it’s difficult to assess the accuracy of the matched points. Using different colors (e.g. red and green) to illustrate correct and incorrect mappings would improve clarity. - Adding a figure to illustrate feature injection would help clarify how point
1. By identifying and aligning stable "semantic points" between garment and human images, SPM-Diff effectively reduces randomness in diffusion models, enabling precise and accurate garment reproduction. 2. The incorporation of 3D depth and normal maps enhances realism by accurately controlling garment fit over the body, reflecting considerable thought in modeling garment behavior in 3D space.
1. SPM-Diff's dependence on semantic points may lead to instability when many points are used due to interpolation and projection errors, as discussed regarding point count sensitivity. 2. SPM-Diff relies heavily on accurate depth and normal maps, which may limit its generalization to datasets or images without dependable 3D cues. 3. Although the model effectively preserves garment details, evaluations primarily concentrate on image quality metrics, neglecting user-centered metrics such as per
1. The proposed method is able to generate high-quality virtual try-on results. Experiments show that the proposed method outperforms existing pipelines. 2. The idea of the paper that adopts pair of semantic points to facilitate the generation process and serve as an extra supervision signal is interesting and reasonable. If robust corresponding points could be found between the garment and the target body, they could be very good priors to boost the diffusion process. 3. The paper is well-writ
The main weakness of the paper is on semantic point matching. a) How to acquire robust and accurate point matching should be the focus of the paper. However, this paper did not address this problem but rely on the local warping method proposed in GP-VTON[1], which is another virtual try-on method. This raises the concern about the contribution. b) The evaluation of the accuracy of the point-matching method is very limited. The point matching should achieve much more accurate results than the bas
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Face recognition and analysis
MethodsDiffusion · Sparse Evolutionary Training
