Evaluating Attribute Confusion in Fashion Text-to-Image Generation

Ziyue Liu; Federico Girella; Yiming Wang; Davide Talon

arXiv:2507.07079·cs.CV·July 10, 2025

Evaluating Attribute Confusion in Fashion Text-to-Image Generation

Ziyue Liu, Federico Girella, Yiming Wang, Davide Talon

PDF

Open Access

TL;DR

This paper introduces L-VQAScore, an automatic metric for evaluating attribute accuracy in fashion text-to-image generation, addressing limitations of existing methods by focusing on entity-attribute localization.

Contribution

It presents a novel localized VQA-based evaluation protocol and a new metric, L-VQAScore, for better assessment of attribute-entity alignment in fashion T2I models.

Findings

01

L-VQAScore correlates better with human judgments than existing metrics.

02

The method effectively detects attribute confusion and mislocalization.

03

It outperforms state-of-the-art evaluation methods on a challenging dataset.

Abstract

Despite the rapid advances in Text-to-Image (T2I) generation models, their evaluation remains challenging in domains like fashion, involving complex compositional generation. Recent automated T2I evaluation methods leverage pre-trained vision-language models to measure cross-modal alignment. However, our preliminary study reveals that they are still limited in assessing rich entity-attribute semantics, facing challenges in attribute confusion, i.e., when attributes are correctly depicted but associated to the wrong entities. To address this, we build on a Visual Question Answering (VQA) localization strategy targeting one single entity at a time across both visual and textual modalities. We propose a localized human evaluation protocol and introduce a novel automatic metric, Localized VQAScore (L-VQAScore), that combines visual localization with VQA probing both correct (reflection) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling