MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi; Bin Xie; Yingfei Liu; Lin Sun; Fengrong Liu; Tiancai Wang; Erjin Zhou; Haoqiang Fan; Xiangyu Zhang; Gao Huang

arXiv:2508.19236·cs.RO·February 2, 2026

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang

PDF

2 Models 3 Datasets 3 Reviews

TL;DR

MemoryVLA introduces a cognitive-inspired memory framework for vision-language-action models, significantly improving long-horizon robotic manipulation by effectively utilizing perceptual and episodic memory mechanisms.

Contribution

The paper presents MemoryVLA, a novel framework that integrates working memory and long-term episodic memory inspired by human cognition into VLA models for robotic tasks.

Findings

01

Outperforms state-of-the-art baselines on multiple simulation benchmarks.

02

Achieves 84% success rate on real-world long-horizon tasks.

03

Significant improvements in success rates on complex manipulation tasks.

Abstract

Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

* Clear structure with good motivation: the task of handling of non-Markovian tasks is interesting and significant in robotics, where the motivation fused memory term within the architecture design. * Extensive evaluation: the authors evaluate MemoryVLA across three different robots, three distinct simulation benchmarks (SimplerEnv-Bridge, SimplerEnv-Fractal, LIBERO), and a set of 12 real-world tasks. This comprehensive evaluation on 150+ tasks with 500+ variations provides high confidence in th

Weaknesses

* The ambiguity of optimal memory length: from the ablation study in Table 5, it suggests that a memory length of $L=16$ is optimal (71.9% success), while the performance worsens at $L=64$ (67.7%). However, in the Appendix, the authors state that a memory length of $L=256$ was used for real-world long-horizon tasks. There lacks of in-depth analysis of how memory length is associated with the actual performance. * The mechanism of using single cognitive token: for complex tasks require multiple l

Reviewer 02Rating 2Confidence 5

Strengths

1. Incorporating memory mechanisms into VLAs is a highly relevant and important research direction. 2. The paper presents a large number of experiments, including those conducted on a real robot. 3. The work is well written and easy to read.

Weaknesses

1. Despite the large number of experiments, the main drawback of the paper is that most of the tasks used do not actually require a memory mechanism. The authors should conduct comparisons on specialized robotics benchmarks focused on memory-based tasks, such as Mikasa-Robo [1] and MemoryBench [2]. Without these experiments, it is impossible to properly evaluate the effectiveness of the proposed memory mechanism. 2. The results on LIBERO outperform Discrete Diffusion VLA [3] by only 0.3, even t

Reviewer 03Rating 4Confidence 4

Strengths

1. The working-memory vs. long-term (episodic + semantic) split is directly inspired by human memory and mapped cleanly to a VLA stack (perceptual/cognitive tokens + Perceptual-Cognitive Memory Bank). This makes the temporal modeling choice easy to justify and reason about. 2. Converting observations into perceptual and cogntiive tokens enable lightweight retrieval, fusion and consolidations. 3. Good performance on SimplerEnv-Bridge benchmark and LIBERO.

Weaknesses

1,Benchmark mismatch (memory not actually required). Fundamentally, the simulation benchmark used does not evaluate memory: the tasks appear in-distribution, short-horizon, and solvable without non-Markovian reasoning. I recommend evaluating on a benchmark that explicitly requires memory, such as Memory-Bench (from SAM2Act), to substantiate the paper’s claims. 2.Inadequate baselines (no memory or long-context retrieval). The chosen baselines are not memory-enhanced and do not leverage long co

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.