Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Ping Chen; Daoxuan Zhang; Xiangming Wang; Yungeng Liu; Haijin Zeng; Yongyong Chen

arXiv:2603.18627·cs.AI·March 20, 2026

Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Ping Chen, Daoxuan Zhang, Xiangming Wang, Yungeng Liu, Haijin Zeng, Yongyong Chen

PDF

Open Access

TL;DR

This paper introduces AFS-Search, a training-free, closed-loop framework for spatially grounded text-to-image generation that improves accuracy and speed by dynamically steering and exploring multiple generation trajectories using a vision-language model as a semantic critic.

Contribution

The paper presents a novel training-free, closed-loop search framework with flow steering and parallel rollout, enhancing spatial grounding and semantic accuracy in T2I generation without additional training.

Findings

01

Achieves state-of-the-art results on three benchmarks.

02

Significantly improves performance of FLUX.1-dev.

03

Offers a faster variant with competitive results.

Abstract

Precise Text-to-Image (T2I) generation has achieved great success but is hindered by the limited relational reasoning of static text encoders and the error accumulation in open-loop sampling. Without real-time feedback, initial semantic ambiguities during the Ordinary Differential Equation trajectory inevitably escalate into stochastic deviations from spatial constraints. To bridge this gap, we introduce AFS-Search (Agentic Flow Steering and Parallel Rollout Search), a training-free closed-loop framework built upon FLUX.1-dev. AFS-Search incorporates a training-free closed-loop parallel rollout search and flow steering mechanism, which leverages a Vision-Language Model (VLM) as a semantic critic to diagnose intermediate latents and dynamically steer the velocity field via precise spatial grounding. Complementarily, we formulate T2I generation as a sequential decision-making process,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling