DS-VTON: An Enhanced Dual-Scale Coarse-to-Fine Framework for Virtual Try-On
Xianbing Sun, Yan Hong, Jiahui Zhan, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang

TL;DR
DS-VTON introduces a dual-scale coarse-to-fine framework for virtual try-on that improves garment alignment and texture preservation without relying on segmentation masks, achieving state-of-the-art results.
Contribution
The paper presents a novel dual-stage approach combining low-res structural alignment with high-res texture refinement using diffusion processes, eliminating the need for human parsing maps.
Findings
Outperforms previous methods in alignment accuracy
Enhances texture detail preservation in try-on results
Operates without segmentation masks
Abstract
Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. These two requirements map directly onto a coarse-to-fine generation paradigm, where the coarse stage handles structural alignment and the fine stage recovers rich garment details. Motivated by this observation, we propose DS-VTON, an enhanced dual-scale coarse-to-fine framework that tackles the try-on problem more effectively. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. In the second stage, a blend-refine diffusion process reconstructs high-resolution outputs by refining…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The writing of this paper is easy to follow. The motivation is well clarified, and the proposed method is easy to understand. 2. The quantitative and qualitative comparisons with state-of-the-art methods on two public virtual try-on datasets demonstrate the effectiveness of the proposed method.
1. One of the major problem of this paper lies in the novelty of the proposed DS-VTON method. The idea of first generating low-resolution images and then transforming them into a high-resolution version has already been widely explored in the field of high-resolution image generation. The multi-scale latent upsampling technique used in [1][2][3] is quite similar to the dual-scale DS-VTON method. Could the authors make a comparison with these approaches to elaborate more clearly on their technica
- The overall proposed method is simple and straightforward. - The proposed method does not require a human parsing mask, makes it easier to deploy to real-world usage. - The empirical results are nice against other existing methods.
- The overall method novelty is limited since the multi-scale image processing has been studied for a very long time. And the main technical difference of this method is to use two different diffusion models to handle input images that captures content under two different scales. - The justification and evaluation of the use of human parsing learned at the diffusion model is not demonstrated in the paper. - The necessity of the Blend-refine diffusion reformulation is doubtful. It is recommended
strength: 1.The coarse-to-fine framework is interesting and effective 2. The experiments are sufficient and the experimental results are excellent
weakness: 1. The overall approach is not innovative and is similar to IMAGDressing and MagicCloth. 2. The results in Table 1 are quite different from those of FiTDiT. 3. The generated results of the anime character in Figure 8 have obvious defects, such as the blue long sleeves. 4. The work based on SD is slightly behind, and experiments based on DiT or Flux may be a better choice
- 1. This paper proposed a novel dual-scale, mask-free framework that enhances the coarse-to-fine process and is particularly well-suited for the try-on task. - 2. The mask-free formulation is a clear practical advantage—eliminating dependence on potentially brittle human-parsing or segmentation modules improves robustness and simplifies deployment. - 3. The blend-refine diffusion re-formulation is novel and well-motivated; explicitly bridging low- and high-resolution distributions with a tuna
1. Training data are synthetically amplified with IDM-VTON generations; although the paper acknowledges the risk, visible entanglement still occurs—hair, accessories, or background sometimes change, indicating less-than-perfect disentanglement that could undermine identity preservation in real applications. 2. The low-resolution stage is constrained to a fixed σ = 2; no adaptive or content-aware schedule is explored, so structural detail can be lost for unusually complex garments, and the choic
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Computer Graphics and Visualization Techniques · Cell Image Analysis Techniques
