A Practical Investigation of Spatially-Controlled Image Generation with Transformers

Guoxuan Xia; Harleen Hanspal; Petru-Daniel Tudosiu; Shifeng Zhang; Sarah Parisot

arXiv:2507.15724·cs.CV·November 5, 2025

A Practical Investigation of Spatially-Controlled Image Generation with Transformers

Guoxuan Xia, Harleen Hanspal, Petru-Daniel Tudosiu, Shifeng Zhang, Sarah Parisot

PDF

Open Access

TL;DR

This paper systematically compares transformer-based models for spatially-controlled image generation, proposing a simple baseline, exploring sampling enhancements, and clarifying the role of adapter-based methods to guide future research.

Contribution

It provides a comprehensive, controlled experimental analysis of transformer-based spatially-controlled image generation, introducing a baseline and clarifying the effectiveness of various techniques.

Findings

01

Control token prefilling is an effective baseline.

02

Extending classifier-free guidance improves control consistency.

03

Adapter-based approaches help mitigate forgetting with limited data.

Abstract

Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Face recognition and analysis