ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries

Hang Li; Fengyi Shen; Dong Chen; Liudi Yang; Xudong Wang; Jinkui Shi; Zhenshan Bing; Ziyuan Liu; Alois Knoll

arXiv:2603.12942·cs.RO·March 16, 2026

ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries

Hang Li, Fengyi Shen, Dong Chen, Liudi Yang, Xudong Wang, Jinkui Shi, Zhenshan Bing, Ziyuan Liu, Alois Knoll

PDF

Open Access

TL;DR

ReMem-VLA introduces a dual-level recurrent memory system with learnable queries for vision-language-action models, significantly improving long-term and short-term memory retention in robot control tasks.

Contribution

The paper proposes ReMem-VLA, a novel memory-augmented VLA model with learnable recurrent queries for enhanced temporal memory without extra inference costs.

Findings

01

ReMem-VLA outperforms memory-free baselines on memory-dependent tasks.

02

It demonstrates strong spatial, sequential, episodic, temporal, and visual memory capabilities.

03

The model achieves significant improvements in real-world robot experiments.

Abstract

Vision-language-action (VLA) models for closed-loop robot control are typically cast under the Markov assumption, making them prone to errors on tasks requiring historical context. To incorporate memory, existing VLAs either retrieve from a memory bank, which can be misled by distractors, or extend the frame window, whose fixed horizon still limits long-term retention. In this paper, we introduce ReMem-VLA, a Recurrent Memory VLA model equipped with two sets of learnable queries: frame-level recurrent memory queries for propagating information across consecutive frames to support short-term memory, and chunk-level recurrent memory queries for carrying context across temporal chunks for long-term memory. These queries are trained end-to-end to aggregate and maintain relevant context over time, implicitly guiding the model's decisions without additional training or inference cost.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Robot Manipulation and Learning