VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis

Meng Chu; Senqiao Yang; Haoxuan Che; Suiyun Zhang; Xichen Zhang; Shaozuo Yu; Haokun Gui; Zhefan Rao; Dandan Tu; Rui Liu; Jiaya Jia

arXiv:2512.19243·cs.CV·February 6, 2026

VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis

Meng Chu, Senqiao Yang, Haoxuan Che, Suiyun Zhang, Xichen Zhang, Shaozuo Yu, Haokun Gui, Zhefan Rao, Dandan Tu, Rui Liu, Jiaya Jia

PDF

Open Access 1 Datasets

TL;DR

VisionDirector is a novel vision-language guided system that improves generative image synthesis by effectively handling complex, multi-goal prompts through structured goal extraction, staged editing, and semantic verification, leading to state-of-the-art results.

Contribution

The paper introduces VisionDirector, a training-free, goal-oriented supervision method that enhances multi-goal image editing and synthesis, outperforming existing models on complex benchmarks.

Findings

01

Achieves 7% improvement on GenEval

02

Reduces edit steps from 4.2 to 3.1

03

Enhances consistency in typography and pose editing

Abstract

Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models' performance in real-world settings, we introduce Long Goal Bench (LGBench), a 2,000-task suite (1,000 T2I and 1,000 I2I) whose average instruction contains 18 to 22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find that even state-of-the-art models satisfy fewer than 72 percent of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling with semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

TruemanV5/LGBench
dataset· 3.3k dl
3.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · 3D Shape Modeling and Analysis