Test-time Controllable Image Generation by Explicit Spatial Constraint   Enforcement

Z. Zhang; B. Liu; J. Bao; L. Chen; S. Zhu; J. Yu

arXiv:2501.01368·cs.CV·January 3, 2025

Test-time Controllable Image Generation by Explicit Spatial Constraint Enforcement

Z. Zhang, B. Liu, J. Bao, L. Chen, S. Zhu, J. Yu

PDF

Open Access

TL;DR

This paper introduces a test-time controllable image generation method that enforces spatial constraints separately, improving layout consistency in generated images without requiring model fine-tuning.

Contribution

It proposes a novel approach to decouple and enforce semantic and geometric spatial constraints during test-time generation, enhancing control and generalizability.

Findings

01

Achieves 30% relative boost in layout consistency over state-of-the-art methods.

02

Effectively handles complex spatial conditions with a diffusion-based latents-refill technique.

03

Demonstrates improved test-time controllability on the Coco-stuff dataset.

Abstract

Recent text-to-image generation favors various forms of spatial conditions, e.g., masks, bounding boxes, and key points. However, the majority of the prior art requires form-specific annotations to fine-tune the original model, leading to poor test-time generalizability. Meanwhile, existing training-free methods work well only with simplified prompts and spatial conditions. In this work, we propose a novel yet generic test-time controllable generation method that aims at natural text prompts and complex conditions. Specifically, we decouple spatial conditions into semantic and geometric conditions and then enforce their consistency during the image-generation process individually. As for the former, we target bridging the gap between the semantic condition and text prompts, as well as the gap between such condition and the attention map from diffusion models. To achieve this, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Image and Object Detection Techniques

MethodsSoftmax · Attention Is All You Need · Diffusion