JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Lin Song; Wenbo Li; Guoqing Ma; Wei Tang; Bo Wang; Yuan Zhang; Yijun Yang; Yicheng Xiao; Jianhui Liu; Yanbing Zhang; Guohui Zhang; Wenhu Zhang; Hang Xu; Nan Jiang; Xin Han; Haoze Sun; Maoquan Zhang; Haoyang Huang; Nan Duan

arXiv:2605.04128·cs.GR·May 21, 2026

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan

PDF

1 Repo

TL;DR

JoyAI-Image is a unified multimodal model that integrates visual understanding, text-to-image generation, and image editing, enhancing spatial reasoning and controllable visual synthesis.

Contribution

It introduces a novel architecture combining a spatially enhanced MLLM with MMDiT, along with a scalable training recipe for broad multimodal capabilities.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Enhances geometry-aware reasoning and spatial editing abilities.

03

Demonstrates strong spatial intelligence beyond general visual competence.

Abstract

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jd-opensource/JoyAI-Image
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.