JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan

TL;DR
JoyAI-Image is a unified multimodal model that integrates visual understanding, text-to-image generation, and image editing, enhancing spatial reasoning and controllable visual synthesis.
Contribution
It introduces a novel architecture combining a spatially enhanced MLLM with MMDiT, along with a scalable training recipe for broad multimodal capabilities.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Enhances geometry-aware reasoning and spatial editing abilities.
Demonstrates strong spatial intelligence beyond general visual competence.
Abstract
We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
