CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance
Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Llados, Xiatian Zhu, Anjan Dutta

TL;DR
COUNTLOOP is a training-free method that uses iterative feedback from vision-language models to generate images with precise object counts and high spatial quality, especially in dense scenes.
Contribution
It introduces a novel training-free framework combining scene layout planning and feedback-driven refinement for high-instance image generation.
Findings
Reduces counting error by up to 57% on benchmarks.
Achieves highest or comparable spatial quality scores.
Maintains photorealism in densely occluded scenes.
Abstract
Diffusion models excel at photorealistic synthesis but struggle with precise object counts, especially in high-density settings. We introduce COUNTLOOP, a training-free framework that achieves precise instance control through iterative, structured feedback. Our method alternates between synthesis and evaluation: a VLM-based planner generates structured scene layouts, while a VLM-based critic provides explicit feedback on object counts, spatial arrangements, and visual quality to refine the layout iteratively. Instance-driven attention masking and cumulative attention composition further prevent semantic leakage, ensuring clear object separation even in densely occluded scenes. Evaluations on COCO-Count, T2I-CompBench, and two newly introduced high instance benchmarks show that COUNTLOOP reduces counting error by up to 57% and achieves the highest or comparable spatial quality scores…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
