Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du

TL;DR
The paper introduces CLVR, a novel framework that combines visual-language reasoning with pixel-level diffusion to improve complex text-to-image generation, addressing current limitations in planning, verification, and latency.
Contribution
CLVR offers a comprehensive system integrating visual verification, reinforcement learning, and weight merging to enhance reasoning, stability, and efficiency in complex visual generation tasks.
Findings
CLVR outperforms existing open-source baselines on multiple benchmarks.
It approaches the performance of proprietary commercial models.
The framework enables scalable, test-time complex visual generation.
Abstract
Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
