Token Merging for Training-Free Semantic Binding in Text-to-Image   Synthesis

Taihang Hu; Linxuan Li; Joost van de Weijer; Hongcheng Gao; Fahad; Shahbaz Khan; Jian Yang; Ming-Ming Cheng; Kai Wang; Yaxing Wang

arXiv:2411.07132·cs.CV·November 12, 2024

Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis

Taihang Hu, Linxuan Li, Joost van de Weijer, Hongcheng Gao, Fahad, Shahbaz Khan, Jian Yang, Ming-Ming Cheng, Kai Wang, Yaxing Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Token Merging (ToMe), a novel method that improves semantic binding in text-to-image synthesis by aggregating tokens, addressing complex object-attribute relationships without extensive fine-tuning.

Contribution

The paper proposes Token Merging (ToMe), a training-free approach that enhances semantic binding in T2I models by aggregating tokens and using auxiliary losses, outperforming existing methods.

Findings

01

ToMe improves semantic binding accuracy in complex scenarios.

02

It outperforms existing methods on T2I-CompBench and GPT-4o benchmarks.

03

Code will be publicly available for reproducibility.

Abstract

Although text-to-image (T2I) models exhibit remarkable generation capabilities, they frequently fail to accurately bind semantically related objects or attributes in the input prompts; a challenge termed semantic binding. Previous approaches either involve intensive fine-tuning of the entire T2I model or require users or large language models to specify generation layouts, adding complexity. In this paper, we define semantic binding as the task of associating a given object with its attribute, termed attribute binding, or linking it to other related sub-objects, referred to as object binding. We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token. This ensures that the object, its attributes and sub-objects all share the same cross-attention map. Additionally, to address potential confusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hutaihang/tome
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques