HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

Tencent Robotics X; HY Vision Team: Xumin Yu; Zuyan Liu; Ziyi Wang; He Zhang; Yongming Rao; Fangfu Liu; Yani Zhang; Ruowen Zhao; Oran Wang; Yves Liang; Haitao Lin; Minghui Wang; Yubo Dong; Kevin Cheng; Bolin Ni; Rui Huang; Han Hu; Zhengyou Zhang; Linus; Shunyu Yao

arXiv:2604.07430·cs.CV·April 10, 2026

HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

Tencent Robotics X, HY Vision Team: Xumin Yu, Zuyan Liu, Ziyi Wang, He Zhang, Yongming Rao, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, Yves Liang, Haitao Lin, Minghui Wang, Yubo Dong, Kevin Cheng, Bolin Ni, Rui Huang, Han Hu, Zhengyou Zhang, Linus, Shunyu Yao

PDF

1 Repo 5 Models

TL;DR

HY-Embodied-0.5 introduces specialized foundation models for embodied agents, enhancing visual perception and reasoning with a Mixture-of-Transformers architecture and on-policy distillation, validated across numerous benchmarks and real-world robot tasks.

Contribution

The paper presents a novel family of models tailored for embodied intelligence, combining efficient and large variants with advanced architecture and training paradigms, and open-sourcing the code.

Findings

01

The MoT-2B model outperforms similar-sized state-of-the-art models on 16 benchmarks.

02

The 32B model achieves performance comparable to frontier models like Gemini 3.0 Pro.

03

The models demonstrate effective real-world robot control capabilities.

Abstract

We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Tencent-Hunyuan/HY-Embodied
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.