VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models
Hu Xiaobin, Liang Yujie, Luo Donghao, Peng Xu, Zhang Jiangning, Zhu Junwei, Wang Chengjie, Fu Yanwei

TL;DR
VTBench is a comprehensive, multi-dimensional benchmark suite designed to evaluate virtual try-on models in real-world scenarios, addressing current limitations in metrics, test diversity, and perceptual alignment.
Contribution
The paper introduces VTBench, a hierarchical benchmark with tailored test sets and evaluation criteria, emphasizing human perception and real-world complexity for virtual try-on models.
Findings
Model performance varies across dimensions and scenarios.
Indoor vs. real-world try-on performance disparities.
Human annotations improve perceptual evaluation accuracy.
Abstract
While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons:(1) Current metrics inadequately reflect human perception, particularly in unpaired try-on settings;(2)Most existing test sets are limited to indoor scenarios, lacking complexity for real-world evaluation; and (3) An ideal system should guide future advancements in virtual try-on generation. To address these needs, we introduce VTBench, a hierarchical benchmark suite that systematically decomposes virtual image try-on into hierarchical, disentangled dimensions, each equipped with tailored test sets and evaluation criteria. VTBench exhibits three key advantages:1) Multi-Dimensional Evaluation Framework: The benchmark encompasses five critical dimensions for virtual try-on generation (e.g., overall…
Peer Reviews
Decision·Submitted to ICLR 2026
+ This work establishes a foundation for future research toward realistic and perceptually aligned virtual try-on systems. + It evaluates 16 state-of-the-art models across multiple paradigms, offering valuable comparative insights for the community. + It tailors existing models for virtual try-on evaluation from different perspectives.
- The aesthetic metric correlates poorly with humans, significantly discrediting the reliability of its use in evaluating and comparing VTON models. - It lacks comparison with commonly used full-reference and no-reference image quality assessment metrics. - Its claim to guide the development of future VTON models is somewhat overstated, since the benchmark itself doesn’t propose novel generative methods or optimization strategies.
- The paper's most outstanding strength is its large-scale human preference annotation study. The results robustly demonstrate that the proposed new metrics (e.g., for cross-category, texture, hand, and background consistency) are highly correlated with human perceptual judgments , a feature broadly lacking in existing metrics. The VLM-based "Size Fitness" metric and the OCR-based "Font Texture Similarity" (FTS) metric are highly innovative. They elevate evaluation from low-level pixel similar
- **Questionable Efficacy of Visual Texture Metric**: The paper computes the cosine similarity between CLIP or DINO embeddings of the original garment and the cropped garment region from the generated image to judge visual texture. However, CLIP and DINO are trained via contrastive or self-supervised learning, which are not inherently designed to enhance fine-grained details. For example, generative methods like IP-Adapter, which use CLIP as an image encoder, often fail to restore reference ima
1) The motivation of this paper precisely targets the major challenge in virtual try-on evaluation: coarse similarity metrics that are inconsistent with human perception and unsuitable for texture detail evaluation. 2) Font texture similarity is a novel metric that has been overlooked in prior work, but is very important in real-world setting where it's necessary to preserve brand's text logo. 3) The proposed benchmark is valuable to the virtual try-on community.
1) Using VLM model to determine size fitness is not convincing. It is difficult to evaluate if a generated garment fits the original body shape because of clothing-body occlusion in the clothed model image. In addition, size fitness itself can also be decomposed into three categories: oversized/loose fitting, normal fitting and tight fitting. The authors provide human alignment scores in the experiments, but it's better to see more evidence showing that the size evaluation in VLM is accurate. Pe
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications
