# EO-1: An Open Unified Embodied Foundation Model for General Robot Control

**Authors:** Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Dong Wang, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, Xuelong Li

arXiv: 2508.21112 · 2026-02-26

## TL;DR

EO-1 is a unified embodied foundation model trained on a large multimodal dataset, enabling advanced robot control and reasoning across diverse tasks and embodiments, pushing towards human-level flexibility in open-world interactions.

## Contribution

The paper introduces EO-1, a novel unified architecture and a large-scale dataset, EO-Data1.5M, for multimodal embodied reasoning and robot control, achieving superior generalization and flexibility.

## Key findings

- EO-1 outperforms previous models in multimodal reasoning tasks.
- The model demonstrates strong generalization across multiple robot embodiments.
- Interleaved vision-text-action training improves open-world understanding.

## Abstract

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, we introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models. Project Page: https://eo-robotics.ai/eo-1.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21112/full.md

## Figures

35 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21112/full.md

---
Source: https://tomesphere.com/paper/2508.21112