Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals
Davide Lobba, Fulvio Sanguigni, Bin Ren, Marcella Cornia, Rita Cucchiara, Nicu Sebe

TL;DR
This paper introduces TEMU-VTOFF, a novel framework for inverse virtual try-on that reconstructs standardized garment images from clothed person photos, addressing ambiguity and detail loss issues with multimodal and alignment modules.
Contribution
The paper presents TEMU-VTOFF, the first multi-category inverse virtual try-on model utilizing multimodal attention and structure-texture alignment for improved realism and consistency.
Findings
Achieves state-of-the-art results on VITON-HD and Dress Code datasets.
Significantly improves visual realism and garment consistency.
Effectively resolves visual ambiguities using multimodal cues.
Abstract
Virtual try-on (VTON) has been widely explored for rendering garments onto person images, while its inverse task, virtual try-off (VTOFF), remains largely overlooked. VTOFF aims to recover standardized product images of garments directly from photos of clothed individuals. This capability is of great practical importance for e-commerce platforms, large-scale dataset curation, and the training of foundation models. Unlike VTON, which must handle diverse poses and styles, VTOFF naturally benefits from a consistent output format in the form of flat garment images. However, existing methods face two major limitations: (i) exclusive reliance on visual cues from a single photo often leads to ambiguity, and (ii) generated images usually suffer from loss of fine details, limiting their real-world applicability. To address these challenges, we introduce TEMU-VTOFF, a Text-Enhanced MUlti-category…
Peer Reviews
Decision·ICLR 2026 Poster
[S1] Good quantitative and qualitative results [S2] A good ablation study justifying most of the design choices. [S3] Well written and easy to follow.
[W1] A3 section of the appendix suggests that the garment captions are based on textual descriptions of the e-commerce garment image. This seems like a fundamental flaw as the original garment caption is not going to exist for samples in the wild where the ground truth is not going to be known. This presents information about test directly seeping into the inference process. [W2] Some unclear implementation details. See questions. [W3] A couple of additional ablations would be useful. e.g. Vel
- Purpose-built architecture for try-off instead of reversing VTON pipelines, enabling clean reconstruction across multiple garment categories (upper / lower / full-body). - Multimodal hybrid attention improves disambiguation and detail preservation by combining visual features with textual descriptions. - High image fidelity and alignment thanks to the garment aligner module, resulting in superior quality and consistency compared to existing methods.
- Your attempt to explore a new direction within the VITON domain is impressive. However, while VITON-HD uses full-body datasets, this paper uses datasets without faces. Is this because including faces would cause errors? - Would VTOFF also work on more limited imagery such as VITON-CROP [1]? Since this work deals with real-world scenarios, I recommend including [1] in the references. - It would also be helpful if the ablation study section were organized in a more intuitive manner. [1] Kang,
- The task definition is with practical value. Inverse try-on is useful for catalog data creation and dataset enhancement. Multi-category handling in a single pipeline is appealing. - The dual DiT setup and multimodal hybrid attention integrate signals from image, text, and mask in a straightforward and scalable way. - Solid results on Dress Code. The method shows consistent improvements on distributional and perceptual metrics, with ablations that isolate design choices.
(1) Mixed gains on VITON-HD. On VITON-HD the improvements are minor or mixed. For example, LPIPS is **22.50** for One Model for All vs **28.44** for TEMU-VTOFF (LPIPS lower is better, so this favors the baseline), while DISTS is **19.20** vs **18.04** (lower is better, so this favors TEMU-VTOFF). This suggests the gains are not uniform across metrics or categories. A deeper per-category analysis is needed. (2) Metric suitability and tradeoffs. The ablations imply the garment alignment module ca
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFashion and Cultural Textiles · Color perception and design
