Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

Zaijing Li; Bing Hu; Rui Shao; Gongwei Chen; Dongmei Jiang; Pengwei Xie; Jianye Hao; Liqiang Nie

arXiv:2602.20200·cs.RO·May 19, 2026

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

Zaijing Li, Bing Hu, Rui Shao, Gongwei Chen, Dongmei Jiang, Pengwei Xie, Jianye Hao, Liqiang Nie

PDF

TL;DR

This paper introduces OptimusVLA, a dual-memory vision-language-action model that enhances robotic manipulation by improving inference efficiency and robustness through global priors and local temporal consistency.

Contribution

It proposes a novel dual-memory framework with global prior and local consistency memories to address efficiency and robustness issues in hierarchical VLA models.

Findings

01

Achieves 98.6% success on LIBERO benchmark.

02

Improves success rate by 13.5% on CALVIN.

03

Delivers 2.9x inference speedup in real-world tests.

Abstract

Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics