M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning

Bangji Yang; Ruihan Guo; Jiajun Fan; Chaoran Cheng; Ge Liu

arXiv:2602.06166·cs.CV·February 9, 2026

M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning

Bangji Yang, Ruihan Guo, Jiajun Fan, Chaoran Cheng, Ge Liu

PDF

Open Access

TL;DR

M3 is a training-free multi-agent framework that iteratively refines text-to-image generation, significantly improving compositional accuracy and surpassing state-of-the-art commercial models on challenging benchmarks.

Contribution

Introduces M3, a novel multi-agent, multi-round inference framework that enhances open-source text-to-image models without retraining, achieving state-of-the-art compositional generation performance.

Findings

01

Outperforms commercial models on OneIG-EN benchmark

02

Doubles spatial reasoning performance on hard test sets

03

Enhances open-source models with no retraining needed

Abstract

Generative models have achieved impressive fidelity in text-to-image synthesis, yet struggle with complex compositional prompts involving multiple constraints. We introduce \textbf{M3 (Multi-Modal, Multi-Agent, Multi-Round)}, a training-free framework that systematically resolves these failures through iterative inference-time refinement. M3 orchestrates off-the-shelf foundation models in a robust multi-agent loop: a Planner decomposes prompts into verifiable checklists, while specialized Checker, Refiner, and Editor agents surgically correct constraints one at a time, with a Verifier ensuring monotonic improvement. Applied to open-source models, M3 achieves remarkable results on the challenging OneIG-EN benchmark, with our Qwen-Image+M3 surpassing commercial flagship systems including Imagen4 (0.515) and Seedream 3.0 (0.530), reaching state-of-the-art performance (0.532 overall). This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Machine Learning in Materials Science