Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

Yi Zhang; Yinda Chen; Che Liu; Zeyuan Ding; Jin Xu; Shilong Zou; Junwei Liao; Jiayu Hu; Xiancong Ren; Xiaopeng Zhang; Yechi Liu; Haoyuan Shi; Zecong Tang; Haosong Sun; Renwen Cui; Kuishu Wu; Wenhai Liu; Yang Xu; Yingji Zhang; Yidong Wang; Senkang Hu; Jinpeng Lu; Nga Teng Chan; Yechen Wu; Zeting Liu; Xianzhou Hou; Yong Dai; Jian Tang; Xiaozhu Ju

arXiv:2605.15153·cs.RO·May 22, 2026

Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu, Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, Yechi Liu, Haoyuan Shi, Zecong Tang, Haosong Sun, Renwen Cui, Kuishu Wu, Wenhai Liu, Yang Xu, Yingji Zhang, Yidong Wang, Senkang Hu, Jinpeng Lu, Nga Teng Chan

PDF

TL;DR

Pelican-Unify 1.0 is a pioneering embodied foundation model that unifies understanding, reasoning, imagination, and action in a single system, achieving state-of-the-art results across multiple benchmarks.

Contribution

It introduces the first unified embodied foundation model trained to jointly optimize understanding, reasoning, imagination, and action, avoiding the need for isolated expert systems.

Findings

01

Achieves 64.7 on eight VLM benchmarks, the best among comparable models.

02

Ranks first on WorldArena with 66.03.

03

Scores 93.5 on RoboTwin, second-best among action methods.

Abstract

We present Pelican-Unify 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unify 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a dense latent variable. A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process. The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.