Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui; Honghao Chen; Haoge Deng; Xu Huang; Xinghang Li; Jirong Liu; Yang Liu; Zhuoyan Luo; Jinsheng Wang; Wenxuan Wang; Yueze Wang; Chengyuan Wang; Fan Zhang; Yingli Zhao; Ting Pan; Xianduo Li; Zecheng Hao; Wenxuan Ma; Zhuo Chen; Yulong Ao; Tiejun Huang; Zhongyuan Wang; Xinlong Wang

arXiv:2510.26583·cs.CV·October 31, 2025

Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang

PDF

5 Models

TL;DR

Emu3.5 is a large-scale multimodal world model trained on extensive vision-language data, capable of complex generation, reasoning, and world modeling, with significant inference speed improvements and open-source availability.

Contribution

The paper introduces Emu3.5, a novel unified multimodal model with efficient inference, advanced capabilities, and extensive training on interleaved vision-language data, advancing multimodal AI research.

Findings

01

Achieves strong multimodal generation and reasoning performance.

02

Accelerates inference by approximately 20x with DiDA.

03

Demonstrates generalizable world-modeling and open-world manipulation.

Abstract

We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.