AnchorVLA4D: an Anchor-Based Spatial-Temporal Vision-Language-Action Model for Robotic Manipulation

Juan Zhu; Zhanying Shao; Xiaoqi Li; Ethan Morgan; Jiadong Xu; Hongwei Fan; Hao Dong

arXiv:2603.12730·cs.RO·March 16, 2026

AnchorVLA4D: an Anchor-Based Spatial-Temporal Vision-Language-Action Model for Robotic Manipulation

Juan Zhu, Zhanying Shao, Xiaoqi Li, Ethan Morgan, Jiadong Xu, Hongwei Fan, Hao Dong

PDF

Open Access

TL;DR

AnchorVLA4D enhances robotic manipulation by integrating visual anchors and a spatial encoder to improve spatial-temporal reasoning, leading to better object handling and higher success rates without extra sensors.

Contribution

The paper introduces AnchorVLA4D, a novel spatial-temporal VLA model that uses visual anchors and a lightweight encoder to improve spatial awareness in robotic manipulation tasks.

Findings

01

Achieved 13.6% improvement on the Simpler WidowX benchmark.

02

Attained an average success rate of 80% on real-world tasks.

03

Requires no additional sensing modalities, maintaining low inference overhead.

Abstract

Since current Vision-Language-Action (VLA) systems suffer from limited spatial perception and the absence of memory throughout manipulation, we investigate visual anchors as a means to enhance spatial and temporal reasoning within VLA policies for robotic manipulation. Conventional VLAs generate actions by conditioning on a single current frame together with a language instruction. However, since the frame is encoded as a 2D image, it does not contain detailed spatial information, and the VLA similarly lacks any means to incorporate past context. As a result, it frequently forgets objects under occlusion and becomes spatially disoriented during the manipulation process. Thus, we propose AnchorVLA4D, a simple spatial-temporal VLA that augments the visual input with an anchor image to preserve the initial scene context throughout execution, and adds a lightweight spatial encoder that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Robotic Path Planning Algorithms