VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory
Yuheng Lei, Zhixuan Liang, Hongyuan Zhang, Ping Luo

TL;DR
VPWEM introduces a non-Markovian visuomotor policy with working and episodic memories, enabling robotic control systems to handle long-term dependencies efficiently and outperform existing methods in memory-intensive tasks.
Contribution
The paper presents VPWEM, a novel policy combining working and episodic memories with a Transformer-based compressor, addressing long-term memory challenges in robotic visuomotor tasks.
Findings
Outperforms state-of-the-art baselines by over 20% on manipulation tasks.
Achieves 5% improvement on the MoMaRT mobile manipulation benchmark.
Operates with nearly constant memory and computation per step.
Abstract
Imitation learning from human demonstrations has achieved significant success in robotic control, yet most visuomotor policies still condition on single-step observations or short-context histories, making them struggle with non-Markovian tasks that require long-term memory. Simply enlarging the context window incurs substantial computational and memory costs and encourages overfitting to spurious correlations, leading to catastrophic failures under distribution shift and violating real-time constraints in robotic systems. By contrast, humans can compress important past experiences into long-term memories and exploit them to solve tasks throughout their lifetime. In this paper, we propose VPWEM, a non-Markovian visuomotor policy equipped with working and episodic memories. VPWEM retains a sliding window of recent observation tokens as short-term working memory, and introduces a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Multimodal Machine Learning Applications
