Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning

Egor Cherepanov; Nikita Kachaev; Alexey K. Kovalev; Aleksandr I. Panov

arXiv:2502.10550·cs.LG·March 5, 2026

Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning

Egor Cherepanov, Nikita Kachaev, Alexey K. Kovalev, Aleksandr I. Panov

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

This paper introduces MIKASA, a comprehensive benchmark suite designed to evaluate memory capabilities of reinforcement learning agents in complex, memory-intensive tasks, especially in robotic manipulation scenarios.

Contribution

The paper presents a new unified framework and benchmark suite, MIKASA, for systematically assessing memory in RL agents across diverse and robotic manipulation tasks.

Findings

01

MIKASA enables systematic evaluation of memory in RL agents.

02

The benchmark includes 32 carefully designed tasks for robotic manipulation.

03

MIKASA facilitates advancing research in memory-enhanced RL systems.

Abstract

Memory is crucial for enabling agents to tackle complex tasks with temporal and spatial dependencies. While many reinforcement learning (RL) algorithms incorporate memory, the field lacks a universal benchmark to assess an agent's memory capabilities across diverse scenarios. This gap is particularly evident in tabletop robotic manipulation, where memory is essential for solving tasks with partial observability and ensuring robust performance, yet no standardized benchmarks exist. To address this, we introduce MIKASA (Memory-Intensive Skills Assessment Suite for Agents), a comprehensive benchmark for memory RL, with three key contributions: (1) we propose a comprehensive classification framework for memory-intensive RL tasks, (2) we collect MIKASA-Base -- a unified benchmark that enables systematic evaluation of memory-enhanced agents across diverse scenarios, and (3) we develop…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The benchmark offers robust and comprehensive coverage, incorporating a diverse set of tasks that rigorously test models across a spectrum of memory requirements. 2. This benchmark successfully strikes a critical balance, bridging the gap between current, often over-simplified evaluation methods and the complexity of real-world, human-level tasks. By incorporating challenges that mirror the cognitive demands of human problem-solving while maintaining experimental tractability, it establishes

Weaknesses

1. The benchmark is specialized for tabletop robotic manipulation tasks. While this area is critical for memory in robotics, the benchmark does not cover other key areas of robotic memory, such as continual learning where the agent must adapt to changing task distributions or physical environments over extended, sequential periods. 2. By design, the tasks are complex and memory-intensive, leading to high computational demands for training. This can limit the accessibility of the benchmark to re

Reviewer 02Rating 8Confidence 4

Strengths

The proposed benchmark addresses spatial and temporal memory -- an important and timely topic in RL. The authors evaluate a broad set of baselines, and since none achieve strong performance across all tasks, the benchmark appears to have an appropriate level of difficulty, leaving room for future progress. It is built on the ManiSkill3 simulation framework, enabling fast GPU parallelisation -- a valuable design choice for scalability. Overall, the paper is well written and easy to follow, with o

Weaknesses

- The main text lacks details on the quality of the offline RL datasets. The authors should explicitly state that these are expert datasets, as this information is important for readers and enables potential use of imitation learning methods. - TD-MPC2 is cited as an arXiv preprint, but it has been published at ICLR 2024. - The bibliography has inconsistent capitalisation (e.g., “Td-mpc2” should be “TD-MPC2”). This can be fixed by correctly using braces {} in BibTeX. - It is unusual to use "Subs

Reviewer 03Rating 8Confidence 3

Strengths

The paper is well-motivated, addressing an important gap in existing RL benchmarks. The comprehensive suite of 32 tasks provides comprehensive coverage of different types of memory requirements in physical robotic tasks. The categorization of previous tasks and integration into a single framework is an important contribution that will facilitate further research. The authors evaluate both online and offline algorithms and motivate the need for both in memory-intensive tasks. The authors convi

Weaknesses

This is not the first benchmark for physical memory-intensive tasks. However, the authors acknowledge this limitation and clearly state that their benchmark captures more types of memory tasks than previous work. The connection to cognitive processes and memory mechanisms could be more fully developed.

Code & Models

Repositories

CognitiveAISystems/MIKASA-Robo
pytorchOfficial

Datasets

avanturist/mikasa-robo
dataset· 1.5k dl
1.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics