Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal   Perspective

Xiangru Zhu; Penglei Sun; Yaoxian Song; Yanghua Xiao; Zhixu Li,; Chengyu Wang; Jun Huang; Bei Yang; Xiaoxiao Xu

arXiv:2410.10291·cs.CL·April 18, 2025

Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective

Xiangru Zhu, Penglei Sun, Yaoxian Song, Yanghua Xiao, Zhixu Li,, Chengyu Wang, Jun Huang, Bei Yang, Xiaoxiao Xu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces SemVarEffect and SemVarBench, new tools to evaluate how well text-to-image models handle semantic variations caused by word order changes, revealing the importance of cross-modal alignment.

Contribution

It proposes a novel causality-based evaluation metric and benchmark for assessing semantic variation handling in T2I models, highlighting the role of cross-modal alignment.

Findings

01

CogView-3-Plus and Ideogram 2 achieved top scores

02

Semantic variations in object relations are less understood than attributes

03

Cross-modal alignment significantly impacts handling semantic variations

Abstract

Accurate interpretation and visualization of human instructions are crucial for text-to-image (T2I) synthesis. However, current models struggle to capture semantic variations from word order changes, and existing evaluations, relying on indirect metrics like text-image similarity, fail to reliably assess these challenges. This often obscures poor performance on complex or uncommon linguistic patterns by the focus on frequent word combinations. To address these deficiencies, we propose a novel metric called SemVarEffect and a benchmark named SemVarBench, designed to evaluate the causality between semantic variations in inputs and outputs in T2I synthesis. Semantic variations are achieved through two types of linguistic permutations, while avoiding easily predictable literal variations. Experiments reveal that the CogView-3-Plus and Ideogram 2 performed the best, achieving a score of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhuxiangru/semvarbench
pytorchOfficial

Videos

Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective· slideslive

Taxonomy

TopicsImage Retrieval and Classification Techniques · Natural Language Processing Techniques · Handwritten Text Recognition Techniques

MethodsFocus