Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback
Yuki Hirakawa, Takashi Wada, Ryotaro Shimizu, Takuya Furusawa, Yuki Saito, Ryosuke Araki, Tianwei Chen, Fan Mo, Yoshimitsu Aoki

TL;DR
This paper introduces VTON-IQA, a reference-free, human-aligned image quality assessment framework for virtual try-on systems, supported by a large-scale annotated benchmark, enabling reliable evaluation without ground-truth images.
Contribution
The paper presents VTON-IQA, the first large-scale human-annotated benchmark for virtual try-on quality assessment and a novel transformer-based model with cross-attention for perceptual quality prediction.
Findings
VTON-IQA achieves reliable human-aligned quality predictions.
Benchmark evaluation reveals strengths and weaknesses of 14 VTON models.
The dataset contains over 62,000 annotated try-on images.
Abstract
Given a person image and a garment image, image-based Virtual Try-ON (VTON) synthesizes a try-on image of the person wearing the target garment. As VTON systems become increasingly important in practical applications such as fashion e-commerce, reliable evaluation of their outputs has emerged as a critical challenge. In real-world scenarios, ground-truth images of the same person wearing the target garment are typically unavailable, making reference-based evaluation impractical. Moreover, widely used distribution-level metrics such as Fr\'echet Inception Distance and Kernel Inception Distance measure dataset-level similarity and fail to reflect the perceptual quality of individual generated images. To address these limitations, we propose Image Quality Assessment for Virtual Try-On (VTON-IQA), a reference-free framework for human-aligned, image-level quality assessment without requiring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis
