Memory Proxy Maps for Visual Navigation
Faith Johnson, Bryan Bo Cao, Ashwin Ashok, Shubham Jain, Kristin Dana

TL;DR
This paper introduces a novel visual navigation method using a memory proxy map and feudal learning, achieving state-of-the-art results without relying on RL, graphs, odometry, or metric maps.
Contribution
It presents a new no-RL, no-graph, no-odometry approach with a memory proxy map and hierarchical agents for improved visual navigation.
Findings
Achieves state-of-the-art performance on image goal navigation.
Removes the need for traditional maps and odometry in navigation.
Demonstrates effective self-supervised environment representation.
Abstract
Visual navigation takes inspiration from humans, who navigate in previously unseen environments using vision without detailed environment maps. Inspired by this, we introduce a novel no-RL, no-graph, no-odometry approach to visual navigation using feudal learning to build a three tiered agent. Key to our approach is a memory proxy map (MPM), an intermediate representation of the environment learned in a self-supervised manner by the high-level manager agent that serves as a simplified memory, approximating what the agent has seen. We demonstrate that recording observations in this learned latent space is an effective and efficient memory proxy that can remove the need for graphs and odometry in visual navigation tasks. For the mid-level manager agent, we develop a waypoint network (WayNet) that outputs intermediate subgoals, or waypoints, imitating human waypoint selection during local…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper proposes a novel modular navigation method that does not rely on reinforcement learning, odometry, or metric maps. The memory proxy map is learned through the self-supervised SMoG method, which does not require human supervision or environmental odometry. 2. Extensive experiments on the Gibson dataset have demonstrated the SOTA performance of FeudalNav, compared with RL-based and graph-based baselines. The ablation study also verifies the effectiveness of modules (especially the MPM
The paper is poorly written and fails to deliver a convincing motivation or clear methodology. 1. The whole hierarchical method is not described formally, which makes multiple technical details unclear, including the momentum grouping model (line 158), visual similarity (line 161), confidence of keypoint matches (line 168), gaussian window (line 180), WayNet (line 229), and so on. 2. The approach of imitating human waypoint generation through supervised learning is questionable. First, the state
1. The paper is easy to follow and the key ideas are explained clearly 2. The idea of breaking down the task into separate task-specific components has benefits like interpretability and improvement in sample efficiency 3. The idea of building a latent memory representation using self-supervised learning is novel and a interesting contribution. 4. Similarly, the idea of training a waypoint network using LAVN dataset is promising as it enables possibilities for training navigation agents without
1. One main concern I have with the paper is that the authors claim the method is SoTA on the ImageNav while it is not. A end-to-end method presented in [1] last year achieved significant improvements in ImageNav and InstanceImageNav performance. Comparison with this method is missing in the paper. I would appreciate if authors can add a comparison or explain why comparison with this method wouldn’t be fair and update the claim in the paper accordingly. Additionally, comparison to another method
(1) The paper proposes a hierarchical and pure vision-based navigation framework. The proposed method achieves the best performance on the ImageNav benchmark without RL, explicit map, and odometry. This mapless navigation framework significantly reduces computational complexity for robot applications. (2) The proposed idea of using dimension-reduced feature space as a map (Memory Proxy Maps) is novel and lightweight compared with metric-based maps and topological graphs. (3) The experiment perfo
(1) The entire system is complicated with 3 layers of decision modules, the lack of necessary ablation studies (especially lacking some quantitive metrics) makes it difficult to comprehensively understand the functions of some proposed components. For example, Figure 3 provides the visualization of the difference WayNet prediction coordinates and the ground-truth coordinates. What is the average distance between the prediction and the labels across diverse scenes? Quantitive metrics should be re
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Constraint Satisfaction and Optimization · Data Management and Algorithms
