ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

Guoheng Sun; Tingting Du; Kaixi Feng; Chenxiang Luo; Xingguo Ding; Zheyu Shen; Ziyao Wang; Yexiao He; Ang Li

arXiv:2602.17951·cs.CV·February 23, 2026

ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

Guoheng Sun, Tingting Du, Kaixi Feng, Chenxiang Luo, Xingguo Ding, Zheyu Shen, Ziyao Wang, Yexiao He, Ang Li

PDF

Open Access

TL;DR

ROCKET introduces a residual-oriented multi-layer alignment framework that enhances 3D spatial understanding in vision-language-action models, achieving high success rates with minimal computational cost.

Contribution

It proposes a novel multi-layer alignment method using a shared projector and residual streams, improving spatial understanding in VLA models over prior single-layer approaches.

Findings

01

Achieves 98.5% success rate on LIBERO with only 4% of the compute budget.

02

Outperforms prior methods on LIBERO-Plus and RoboTwin datasets.

03

Demonstrates the effectiveness of residual-oriented multi-layer alignment in VLA models.

Abstract

Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, na\"ive multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Advanced Neural Network Applications