Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Shuang Chen; Quanxin Shou; Hangting Chen; Yucheng Zhou; Kaituo Feng; Wenbo Hu; Yi-Fan Zhang; Yunlong Lin; Wenxuan Huang; Mingyang Song; Dasen Dai; Bolin Jiang; Manyuan Zhang; Shi-Xue Zhang; Zhengkai Jiang; Lucas Wang; Zhao Zhong; Yu Cheng; Nanyun Peng

arXiv:2603.29620·cs.CV·April 2, 2026

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, Dasen Dai, Bolin Jiang, Manyuan Zhang, Shi-Xue Zhang, Zhengkai Jiang, Lucas Wang, Zhao Zhong, Yu Cheng, Nanyun Peng

PDF

1 Repo 1 Models 3 Datasets

TL;DR

Unify-Agent introduces a novel agentic framework for world-grounded image synthesis, effectively integrating reasoning, searching, and generation to handle complex, knowledge-intensive concepts.

Contribution

It presents a unified multimodal agent architecture, a tailored training pipeline with 143K trajectories, and a new benchmark FactIP for evaluating world knowledge grounding in image synthesis.

Findings

01

Significant improvements over base models in diverse benchmarks.

02

Approaches the world knowledge capabilities of top closed-source models.

03

Demonstrates the effectiveness of agent-based modeling in image synthesis.

Abstract

Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shawn0728/Unify-Agent
github

Models

🤗
csfufu/Unify-Agent
model· 12 dl· ♡ 1
12 dl♡ 1

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.