Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Davide Lobba; Fulvio Sanguigni; Bin Ren; Marcella Cornia; Rita Cucchiara; Nicu Sebe

arXiv:2505.21062·cs.CV·February 24, 2026

Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Davide Lobba, Fulvio Sanguigni, Bin Ren, Marcella Cornia, Rita Cucchiara, Nicu Sebe

PDF

Open Access 1 Repo 2 Models 3 Reviews

TL;DR

This paper introduces TEMU-VTOFF, a novel framework for inverse virtual try-on that reconstructs standardized garment images from clothed person photos, addressing ambiguity and detail loss issues with multimodal and alignment modules.

Contribution

The paper presents TEMU-VTOFF, the first multi-category inverse virtual try-on model utilizing multimodal attention and structure-texture alignment for improved realism and consistency.

Findings

01

Achieves state-of-the-art results on VITON-HD and Dress Code datasets.

02

Significantly improves visual realism and garment consistency.

03

Effectively resolves visual ambiguities using multimodal cues.

Abstract

Virtual try-on (VTON) has been widely explored for rendering garments onto person images, while its inverse task, virtual try-off (VTOFF), remains largely overlooked. VTOFF aims to recover standardized product images of garments directly from photos of clothed individuals. This capability is of great practical importance for e-commerce platforms, large-scale dataset curation, and the training of foundation models. Unlike VTON, which must handle diverse poses and styles, VTOFF naturally benefits from a consistent output format in the form of flat garment images. However, existing methods face two major limitations: (i) exclusive reliance on visual cues from a single photo often leads to ambiguity, and (ii) generated images usually suffer from loss of fine details, limiting their real-world applicability. To address these challenges, we introduce TEMU-VTOFF, a Text-Enhanced MUlti-category…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

[S1] Good quantitative and qualitative results [S2] A good ablation study justifying most of the design choices. [S3] Well written and easy to follow.

Weaknesses

[W1] A3 section of the appendix suggests that the garment captions are based on textual descriptions of the e-commerce garment image. This seems like a fundamental flaw as the original garment caption is not going to exist for samples in the wild where the ground truth is not going to be known. This presents information about test directly seeping into the inference process. [W2] Some unclear implementation details. See questions. [W3] A couple of additional ablations would be useful. e.g. Vel

Reviewer 02Rating 6Confidence 4

Strengths

- Purpose-built architecture for try-off instead of reversing VTON pipelines, enabling clean reconstruction across multiple garment categories (upper / lower / full-body). - Multimodal hybrid attention improves disambiguation and detail preservation by combining visual features with textual descriptions. - High image fidelity and alignment thanks to the garment aligner module, resulting in superior quality and consistency compared to existing methods.

Weaknesses

- Your attempt to explore a new direction within the VITON domain is impressive. However, while VITON-HD uses full-body datasets, this paper uses datasets without faces. Is this because including faces would cause errors? - Would VTOFF also work on more limited imagery such as VITON-CROP [1]? Since this work deals with real-world scenarios, I recommend including [1] in the references. - It would also be helpful if the ablation study section were organized in a more intuitive manner. [1] Kang,

Reviewer 03Rating 4Confidence 4

Strengths

- The task definition is with practical value. Inverse try-on is useful for catalog data creation and dataset enhancement. Multi-category handling in a single pipeline is appealing. - The dual DiT setup and multimodal hybrid attention integrate signals from image, text, and mask in a straightforward and scalable way. - Solid results on Dress Code. The method shows consistent improvements on distributional and perceptual metrics, with ablations that isolate design choices.

Weaknesses

(1) Mixed gains on VITON-HD. On VITON-HD the improvements are minor or mixed. For example, LPIPS is **22.50** for One Model for All vs **28.44** for TEMU-VTOFF (LPIPS lower is better, so this favors the baseline), while DISTS is **19.20** vs **18.04** (lower is better, so this favors TEMU-VTOFF). This suggests the gains are not uniform across metrics or categories. A deeper per-category analysis is needed. (2) Metric suitability and tradeoffs. The ablations imply the garment alignment module ca

Code & Models

Repositories

davidelobba/TEMU-VTOFF
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFashion and Cultural Textiles · Color perception and design