XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

Shichao Fan; Kun Wu; Zhengping Che; Xinhua Wang; Di Wu; Fei Liao; Ning Liu; Yixue Zhang; Zhen Zhao; Zhiyuan Xu; Meng Li; Qingjie Liu; Shanghang Zhang; Min Wan; Jian Tang

arXiv:2511.02776·cs.RO·May 15, 2026

XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

Shichao Fan, Kun Wu, Zhengping Che, Xinhua Wang, Di Wu, Fei Liao, Ning Liu, Yixue Zhang, Zhen Zhao, Zhiyuan Xu, Meng Li, Qingjie Liu, Shanghang Zhang, Min Wan, Jian Tang

PDF

1 Repo

TL;DR

XR-1 introduces a unified vision-motion representation for versatile, scalable vision-language-action learning across diverse robots and tasks, leveraging a novel discrete latent encoding and a three-stage training process.

Contribution

The paper proposes UVMC, a discrete latent representation learned via dual-branch VQ-VAE, enabling effective multi-modal knowledge integration for robotic VLA tasks.

Findings

01

XR-1 outperforms state-of-the-art baselines in real-world experiments.

02

XR-1 generalizes well to new objects and environmental variations.

03

Extensive testing on 14,000+ rollouts across six robots and 120 tasks.

Abstract

Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. XR-1 introduces the \emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://xr-1-vla.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.