TL;DR
This paper introduces TWGTM, a unified diffusion-based framework that jointly tackles garment dressing and undressing, addressing a key gap in virtual try-on and try-off tasks with bidirectional feature disentanglement.
Contribution
It presents the first unified model for both VTON and VTOFF tasks, employing bidirectional feature disentanglement and phased training to handle mask dependency asymmetry.
Findings
Outperforms existing methods on DressCode and VITON-HD datasets.
Effectively bridges the modality gap between mask-guided and mask-free tasks.
Demonstrates high-quality, realistic garment transfer and extraction results.
Abstract
While recent advances in virtual try-on (VTON) have achieved realistic garment transfer to human subjects, its inverse task, virtual try-off (VTOFF), which aims to reconstruct canonical garment templates from dressed humans, remains critically underexplored and lacks systematic investigation. Existing works predominantly treat them as isolated tasks: VTON focuses on garment dressing while VTOFF addresses garment extraction, thereby neglecting their complementary symmetry. To bridge this fundamental gap, we propose the Two-Way Garment Transfer Model (TWGTM), to the best of our knowledge, the first unified framework for joint clothing-centric image synthesis that simultaneously resolves both mask-guided VTON and mask-free VTOFF through bidirectional feature disentanglement. Specifically, our framework employs dual-conditioned guidance from both latent and pixel spaces of reference images…
Peer Reviews
Decision·Submitted to ICLR 2026
[S1] The idea of training a single network for both Virtual Try-Off and Virtual Try-On is good, and experiments confirm that there is a strong relationship between the two tasks. [S2] I also like the use of the CatVTOn setup, and the simplicity of swapping the ordering the spatial concatentation of model and garment image to achieve the multi-task capability.
[W1] The presentation of the work is poor, and important context about the method seems to be missing. See questions. [W2] The setup is quite complex with various conditioning features extracted by different, an probably computationally intensive, networks. These choices are ablated, but the result that more capacity improves results is not that interesting. It would be more interesting to ablate the proper input choices (e.g. if the spatial concatenation needs to be reversed, or if all that ca
- First unified diffusion framework that jointly solves VTON and VTOFF in one model. - Dual-space (latent + pixel) conditioning preserves global structure and fine texture simultaneously. - Extended attention block enables seamless fusion of semantic and spatial features, boosting both tasks. - Two-stage training eliminates the mask-dependency gap between masked VTON and mask-free VTOFF. - Consistent SOTA scores on VITON-HD and DressCode with lower FID, LPIPS and DISTS. - Mutual reinfo
- Color shifts remain on extreme-white/black garments due to lighting domain gaps. - Accessories or specular highlights are occasionally misclassified as garment parts, creating artifacts. - Heavy Transformer-based architecture raises inference cost versus single-task models.
- This paper is well-written and is easy to follow. - The experiments demonstrate the effectiveness and competitive performance of the proposed method.
- The parameter similarity comparison in Figure 1a is insufficiently substantiated; there is no clear standard for what constitutes a sufficiently high similarity, and at minimum, a comparison with the base model parameters should be provided. - The definition and role of the reference image are unclear. - The performance improvements achieved by the proposed method are marginal. - The motivation behind the design of the Spatial Refinement Module, as well as the reasoning for each component and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
