UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation

Wonjun Kang; Byeongkeun Ahn; Minjae Lee; Kevin Galim; Seunghyuk Oh; Hyung Il Koo; Nam Ik Cho

arXiv:2508.05399·cs.CV·August 8, 2025

UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation

Wonjun Kang, Byeongkeun Ahn, Minjae Lee, Kevin Galim, Seunghyuk Oh, Hyung Il Koo, Nam Ik Cho

PDF

TL;DR

UNCAGE is a training-free method that enhances compositional accuracy in Masked Generative Transformers for text-to-image generation by using contrastive attention guidance to improve text-image alignment.

Contribution

It introduces UNCAGE, a novel attention-guided approach that improves compositional fidelity without additional training in Masked Generative Transformers.

Findings

01

Improves performance on multiple benchmarks

02

Enhances text-image alignment accuracy

03

Negligible inference overhead

Abstract

Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.