PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

Yupeng Zheng; Xiang Li; Songen Gu; Yuhang Zheng; Shuai Tian; Weize Li; Linbo Wang; Senyu Fei; Pengfei Li; Yinfeng Gao; Zebin Xing; Yilun Chen; Qichao Zhang; Haoran Li; Wenchao Ding

arXiv:2604.20834·cs.RO·April 27, 2026

PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

Yupeng Zheng, Xiang Li, Songen Gu, Yuhang Zheng, Shuai Tian, Weize Li, Linbo Wang, Senyu Fei, Pengfei Li, Yinfeng Gao, Zebin Xing, Yilun Chen, Qichao Zhang, Haoran Li, Wenchao Ding

PDF

2 Repos

TL;DR

PokeVLA is a lightweight vision-language-action model that enhances robot manipulation by integrating comprehensive world knowledge, spatial awareness, and embodied reasoning through a two-stage training process.

Contribution

It introduces a novel two-stage training paradigm for embodied manipulation, combining multimodal pre-training with manipulation-specific representation learning.

Findings

01

Achieves state-of-the-art results on LIBERO-Plus benchmark.

02

Demonstrates robustness and success in real-world robot deployment.

Abstract

Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.