Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

Haitao Lin; Hanyang Yu; Jingshun Huang; He Zhang; Yonggen Ling; Ping Tan; Xiangyang Xue; Yanwei Fu

arXiv:2602.19710·cs.CV·May 19, 2026

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, Yanwei Fu

PDF

TL;DR

Pose-VLA introduces a two-stage pretraining framework that decouples spatial grounding and motion alignment, enabling more generalizable vision-language-action policies for robotics.

Contribution

It proposes a universal pose pretraining paradigm that separates spatial priors from embodiment-specific actions, improving efficiency and generalization in VLA models.

Findings

01

Achieves 79.5% success on RoboTwin 2.0

02

Attains 96.0% performance on LIBERO

03

Demonstrates robust real-world generalization with limited demonstrations

Abstract

Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robot Manipulation and Learning