DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation
Tianjun Gu, Linfeng Li, Xuhong Wang, Chenghua Gong, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, Xin Tan

TL;DR
DORAEMON is a novel, human-inspired framework for autonomous robot navigation that combines high-level scene understanding with low-level path planning, achieving state-of-the-art results without prior map knowledge.
Contribution
It introduces a decentralized, ontology-aware architecture with hierarchical semantic-spatial fusion and enhanced memory, advancing zero-shot navigation in complex environments.
Findings
Achieves state-of-the-art success rate and efficiency metrics.
Outperforms existing methods on multiple benchmark datasets.
Introduces a new metric for navigation intelligence evaluation.
Abstract
Adaptive navigation in unfamiliar environments is crucial for household service robots but remains challenging due to the need for both low-level path planning and high-level scene understanding. While recent vision-language model (VLM) based zero-shot approaches reduce dependence on prior maps and scene-specific training data, they face significant limitations: spatiotemporal discontinuity from discrete observations, unstructured memory representations, and insufficient task understanding leading to navigation failures. We propose DORAEMON (Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation), a novel cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities. The Dorsal Stream implements the Hierarchical Semantic-Spatial Fusion and Topology Map to handle spatiotemporal discontinuities, while the…
Peer Reviews
Decision·Submitted to ICLR 2026
### 1. Innovative Cognitive-Inspired Architecture DORAEMON mimics human "dorsal (spatial)-ventral (semantic)" dual pathways to address core VLM navigation flaws: the Dorsal Stream resolves spatiotemporal discontinuity via a hierarchical Topology Map and semantic-spatial fusion, while the Ventral Stream (CoDe-VLM + Exec-VLM) converts unstructured tasks into knowledge graphs for interpretable reasoning—outperforming baselines with fragmented memory. ### 2. End-to-End & Zero-Shot Capability
### 1. Limited Novelty in Topological Map Representation The work’s use of a topological map for spatial memory lacks strong originality, as topological structures for robot navigation have been extensively explored in prior studies. For instance, recent work (e.g., Mem4Nav, TopoNav) already integrated semantic topological graphs with spatial memory to address navigation continuity. While this paper optimizes node updating/merging, these are incremental tweaks to existing topological paradigms
The dual-stream structure (ventral/dorsal) offers a biologically motivated yet technically meaningful decomposition of perception and reasoning. The integration of semantic graphs and spatial topology is novel for zero-shot navigation. The AORI metric provides a more nuanced measurement of spatial redundancy than SPL/SR, aligning well with embodied intelligence evaluation trends.
Distilling human reasoning ability into navigation tasks is not a new direction, as many works have already explored this idea through various approaches such as NavGPT[1], NavigateDiff[2], and Navid[3]. In addition, numerous studies have leveraged large language models for navigation, which the authors should review more comprehensively and discuss in greater depth. Some parts of the paper are written rather roughly. For example, the **caption of Figure 3** should include a detailed explanatio
- The paper presents a well-engineered navigation framework that combines spatial memory and VLM-based semantic reasoning into a cohesive system. - Strong empirical results are demonstrated on multiple established datasets, showing improvements over recent zero-shot and end-to-end baselines. - The hierarchical memory design is intuitive and may help support longer-horizon reasoning. - Practical concerns such as getting stuck or failing to stop correctly are explicitly addressed through the Na
- **Weak novelty; overly conceptual framing** The cognitive metaphors (decentralized ontology, ventral/dorsal streams) provide an interesting narrative direction, but their influence on the actual technical design remains unclear. Many of the core components, such as hierarchical spatial memory, semantic graph–based task representation, and VLM-guided action reasoning, are already present in recent zero-shot navigation systems. It would strengthen the contribution to more explicitly identify wha
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Social Robot Interaction and HRI
Methodstravel james
