AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models

Xiaoquan Sun; Zetian Xu; Chen Cao; Zonghe Liu; Yihan Sun; Jingrui Pang; Ruijian Zhang; Zhen Yang; Kang Pang; Dingxin He; Mingqi Yuan; Jiayu Chen

arXiv:2603.08519·cs.RO·March 10, 2026

AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models

Xiaoquan Sun, Zetian Xu, Chen Cao, Zonghe Liu, Yihan Sun, Jingrui Pang, Ruijian Zhang, Zhen Yang, Kang Pang, Dingxin He, Mingqi Yuan, Jiayu Chen

PDF

Open Access

TL;DR

AtomVLA introduces a scalable post-training framework for robotic manipulation that leverages predictive latent world models and subtask decomposition, significantly enhancing long-horizon task robustness and success rates.

Contribution

It presents the first subtask-aware VLA framework with a scalable offline post-training pipeline utilizing latent world models for improved robotic manipulation.

Findings

01

Achieves 97.0% success on LIBERO benchmark.

02

Maintains robustness under perturbations.

03

Effective in real-world long-horizon tasks.

Abstract

Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The execution of complex multi-step behaviors in VLA models can be improved by robust instruction grounding, a critical component for effective control. However, current paradigms predominantly rely on coarse, high-level task instructions during supervised fine-tuning. This instruction grounding gap leaves models without explicit intermediate guidance, leading to severe compounding errors in long-horizon tasks. Therefore, bridging this instruction gap and providing scalable post-training for VLA models is urgent. To tackle this problem, we propose \method, the first subtask-aware VLA framework integrated with a scalable offline post-training pipeline. Our framework leverages a large language model to decompose high-level demonstrations into fine-grained atomic subtasks. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics