Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective
Xiangru Zhu, Penglei Sun, Yaoxian Song, Yanghua Xiao, Zhixu Li,, Chengyu Wang, Jun Huang, Bei Yang, Xiaoxiao Xu

TL;DR
This paper introduces SemVarEffect and SemVarBench, new tools to evaluate how well text-to-image models handle semantic variations caused by word order changes, revealing the importance of cross-modal alignment.
Contribution
It proposes a novel causality-based evaluation metric and benchmark for assessing semantic variation handling in T2I models, highlighting the role of cross-modal alignment.
Findings
CogView-3-Plus and Ideogram 2 achieved top scores
Semantic variations in object relations are less understood than attributes
Cross-modal alignment significantly impacts handling semantic variations
Abstract
Accurate interpretation and visualization of human instructions are crucial for text-to-image (T2I) synthesis. However, current models struggle to capture semantic variations from word order changes, and existing evaluations, relying on indirect metrics like text-image similarity, fail to reliably assess these challenges. This often obscures poor performance on complex or uncommon linguistic patterns by the focus on frequent word combinations. To address these deficiencies, we propose a novel metric called SemVarEffect and a benchmark named SemVarBench, designed to evaluate the causality between semantic variations in inputs and outputs in T2I synthesis. Semantic variations are achieved through two types of linguistic permutations, while avoiding easily predictable literal variations. Experiments reveal that the CogView-3-Plus and Ideogram 2 performed the best, achieving a score of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques · Natural Language Processing Techniques · Handwritten Text Recognition Techniques
MethodsFocus
