CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation

V. Kovalev; A. Kuvshinov; A. Buzovkin; D. Pokidov; D. Timonin

arXiv:2512.20362·cs.CV·January 22, 2026

CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation

V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin

PDF

Open Access

TL;DR

CRAFT introduces a training-free, structured inference framework that enhances multimodal text-to-image generation by iteratively verifying and refining images through explicit constraints, improving quality and control without retraining.

Contribution

It presents CRAFT, a novel, model-agnostic method for inference-time refinement using explicit constraints, improving image quality and interpretability in multimodal generation.

Findings

01

Consistently improves compositional accuracy and text rendering.

02

Achieves strong gains for lightweight models.

03

Incur negligible inference overhead.

Abstract

Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of **thinking** based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free and model-agnostic framework for multimodal image generation. CRAFT transforms a user prompt into a set of explicit, dependency-structured visual constraints, verifies generated images using a vision-language model, and performs targeted prompt updates only when specific constraints are violated. This iterative process includes an explicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Historical Architecture and Urbanism