GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning
Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, Wenqiang Zhang

TL;DR
GenAgent introduces an agentic multimodal framework that decouples understanding and generation, enabling autonomous multi-turn reasoning and tool invocation to improve text-to-image generation performance.
Contribution
It presents a novel agentic approach that unifies multimodal understanding and generation with reinforcement learning, allowing dynamic multi-turn interactions and improved image quality.
Findings
Achieves +23.6% on GenEval++ and +14% on WISE benchmarks.
Demonstrates cross-tool generalization and task-adaptive reasoning.
Enables test-time scaling with consistent multi-turn improvements.
Abstract
We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model. Unlike unified models that face expensive training costs and understanding-generation trade-offs, GenAgent decouples these capabilities through an agentic framework: understanding is handled by the multimodal model itself, while generation is achieved by treating image generation models as invokable tools. Crucially, unlike existing modular systems constrained by static pipelines, this design enables autonomous multi-turn interactions where the agent generates multimodal chains-of-thought encompassing reasoning, tool invocation, judgment, and reflection to iteratively refine outputs. We employ a two-stage training strategy: first, cold-start with supervised fine-tuning on high-quality tool invocation and reflection data to bootstrap agent behaviors; second, end-to-end agentic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
