Evaluating Attribute Confusion in Fashion Text-to-Image Generation
Ziyue Liu, Federico Girella, Yiming Wang, Davide Talon

TL;DR
This paper introduces L-VQAScore, an automatic metric for evaluating attribute accuracy in fashion text-to-image generation, addressing limitations of existing methods by focusing on entity-attribute localization.
Contribution
It presents a novel localized VQA-based evaluation protocol and a new metric, L-VQAScore, for better assessment of attribute-entity alignment in fashion T2I models.
Findings
L-VQAScore correlates better with human judgments than existing metrics.
The method effectively detects attribute confusion and mislocalization.
It outperforms state-of-the-art evaluation methods on a challenging dataset.
Abstract
Despite the rapid advances in Text-to-Image (T2I) generation models, their evaluation remains challenging in domains like fashion, involving complex compositional generation. Recent automated T2I evaluation methods leverage pre-trained vision-language models to measure cross-modal alignment. However, our preliminary study reveals that they are still limited in assessing rich entity-attribute semantics, facing challenges in attribute confusion, i.e., when attributes are correctly depicted but associated to the wrong entities. To address this, we build on a Visual Question Answering (VQA) localization strategy targeting one single entity at a time across both visual and textual modalities. We propose a localized human evaluation protocol and introduce a novel automatic metric, Localized VQAScore (L-VQAScore), that combines visual localization with VQA probing both correct (reflection) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
