HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents
Tencent Robotics X, HY Vision Team: Xumin Yu, Zuyan Liu, Ziyi Wang, He Zhang, Yongming Rao, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, Yves Liang, Haitao Lin, Minghui Wang, Yubo Dong, Kevin Cheng, Bolin Ni, Rui Huang, Han Hu, Zhengyou Zhang, Linus, Shunyu Yao

TL;DR
HY-Embodied-0.5 introduces specialized foundation models for embodied agents, enhancing visual perception and reasoning with a Mixture-of-Transformers architecture and on-policy distillation, validated across numerous benchmarks and real-world robot tasks.
Contribution
The paper presents a novel family of models tailored for embodied intelligence, combining efficient and large variants with advanced architecture and training paradigms, and open-sourcing the code.
Findings
The MoT-2B model outperforms similar-sized state-of-the-art models on 16 benchmarks.
The 32B model achieves performance comparable to frontier models like Gemini 3.0 Pro.
The models demonstrate effective real-world robot control capabilities.
Abstract
We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
