CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation
V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin

TL;DR
CRAFT introduces a training-free, structured inference framework that enhances multimodal text-to-image generation by iteratively verifying and refining images through explicit constraints, improving quality and control without retraining.
Contribution
It presents CRAFT, a novel, model-agnostic method for inference-time refinement using explicit constraints, improving image quality and interpretability in multimodal generation.
Findings
Consistently improves compositional accuracy and text rendering.
Achieves strong gains for lightweight models.
Incur negligible inference overhead.
Abstract
Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of **thinking** based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free and model-agnostic framework for multimodal image generation. CRAFT transforms a user prompt into a set of explicit, dependency-structured visual constraints, verifies generated images using a vision-language model, and performs targeted prompt updates only when specific constraints are violated. This iterative process includes an explicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Historical Architecture and Urbanism
