FindingDory: A Benchmark to Evaluate Memory in Embodied Agents
Karmesh Yadav, Yusuf Ali, Gunshi Gupta, Yarin Gal, Zsolt Kira

TL;DR
This paper introduces a new benchmark for evaluating long-term memory and reasoning in embodied agents within the Habitat simulator, addressing current limitations of vision-language models in long-horizon tasks.
Contribution
It presents a comprehensive benchmark with 60 tasks for long-range embodied reasoning, along with baselines integrating vision-language models and navigation policies.
Findings
Benchmark enables scalable evaluation of memory in embodied tasks.
Current models show room for improvement in long-term contextual reasoning.
Procedural extensions allow testing of more challenging memory-dependent scenarios.
Abstract
Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past…
Peer Reviews
Decision·Submitted to ICLR 2026
- The manuscript is well-written and well-organized - The manuscript considers compelling tasks and reasoning mechanisms in Embodied AI - The manuscript provides a reasonable set of initial experiments that provide sufficient headroom for subsequent research - The paper provides a good amount of experiments
The manuscript is missing a principled discussion of why the tasks were generated in the way that they were. Why were the memory tasks generated according to the templates in Section 3.1, specifically? How was it ensured in the task design that these tasks are meaningful in some way, e.g., resemble naturally-occurring tasks?
1. The design of navigation tasks emphasizes that agents rely exclusively on past interaction information for high-level goal selection. This explicitly assesses VLMs' memory retrieval capabilities across diverse dimensions, including single-target spatial tasks, single-target temporal tasks, and multi-target tasks. 2. The integration of high-level and low-level policies reveals that different low-level navigation skills lead to varying degrees of task completion when performing memory-based pra
Oracle agents were employed during memory collection, which appears to introduce several stringent assumptions: 1. The memory construction assumes that all consecutive subtasks are successfully completed, and each is executed in the most efficient manner, i.e., via the shortest path. The experiences collected through this method are excessively "clean" and inconsistent with real-world scenarios. This is because completing multiple subtasks is highly challenging—evidenced by a mere 26.4% success
(1) The paper addresses a clear and critical gap in current research: the lack of rigorous benchmarks for long-horizon memory in embodied agents. The authors convincingly argue why existing video QA and embodied QA benchmarks are insufficient. The problem it tries to solve is of great significant. (2) I think the two-phase setup that decouples experience collection from the evaluation phase is a very strong methodological choice. This design effectively isolates an agent's memory and reasoning
(1) My main reservation is that the evaluation is tightly coupled with a specific hierarchical policy (VLM for goal-frame selection + navigation controller). The paper itself shows in Figure that the performance of this system drops massively when the low-level navigation policy is introduced. This makes it difficult to disentangle the source of failure. For example, is a low success rate on a task due to the VLM's inability to recall the correct information, or is it because the VLM correctly i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
