CogniMap3D: Cognitive 3D Mapping and Rapid Retrieval
Feiran Wang, Junyi Wu, Dawen Cai, Yuan Hong, Yan Yan

TL;DR
CogniMap3D is a bioinspired framework that enhances 3D scene understanding by mimicking human cognition, enabling efficient dynamic scene reconstruction, memory-based retrieval, and pose refinement across extended sequences.
Contribution
It introduces a novel cognitive mapping system with multi-stage motion cues and factor graph optimization for dynamic 3D scene understanding and memory management.
Findings
State-of-the-art performance in video depth estimation
Accurate camera pose reconstruction across sequences
Effective continuous scene understanding and memory updating
Abstract
We present CogniMap3D, a bioinspired framework for dynamic 3D scene understanding and reconstruction that emulates human cognitive processes. Our approach maintains a persistent memory bank of static scenes, enabling efficient spatial knowledge storage and rapid retrieval. CogniMap3D integrates three core capabilities: a multi-stage motion cue framework for identifying dynamic objects, a cognitive mapping system for storing, recalling, and updating static scenes across multiple visits, and a factor graph optimization strategy for refining camera poses. Given an image stream, our model identifies dynamic regions through motion cues with depth and camera pose priors, then matches static elements against its memory bank. When revisiting familiar locations, CogniMap3D retrieves stored scenes, relocates cameras, and updates memory with new observations. Evaluations on video depth estimation,…
Peer Reviews
Decision·ICLR 2026 Poster
1. Conceptual originality. The paper introduces a cognitively inspired formulation of 3D scene understanding that explicitly models long-term memory, recall, and update—bridging human cognitive mapping theories with modern video foundation models. This conceptual framing is both novel and timely for the community’s growing interest in continual, memory-based perception. 2. Technical coherence. The pipeline is well structured: multi-stage motion cues, a dual-modality memory, and factor-graph opti
While the paper is conceptually strong and empirically well-supported, several technical limitations remain that constrain its general applicability and robustness. 1. The proposed multi-stage motion cue framework heavily relies on the accuracy of the underlying Video Foundation Model (VFM), particularly the depth and pose priors obtained from VGGT. Since these priors are directly used to compute geometric residuals and dynamic masks, any failure of the VFM in low-texture or high-illumination-v
1. Systematically introducing the concept of "cognitive memory" into dynamic scene reconstruction is a novel and visionary attempt, which directly addresses the key challenge of transitioning from processing isolated video clips to achieving long-term, persistent environmental perception. 2. The paper is clearly articulated, effectively conveying its complex system architecture and core ideas through high-quality illustrations, enabling readers to clearly understand its workflow and contribution
1. Although 14.32 FPS is reported in Table 1, this speed is quite amazing considering the complexity of the whole system (integrating VGGT, RAFT, DINOv2, PointNet++, SAM2, etc.). It is recommended that the authors more clearly state which modules are covered by this FPS test. 2. The multi-stage, cascaded architecture of the framework raises a key concern that small biases in upstream modules may be amplified later, suggesting that the authors briefly discuss the robustness of the system to initi
1. This paper provides clear writing and figures. The dynamic mask pipeline is easy to follow. 2. Geometry and flow cues plus global mean move refine masks and stabilize tracking. The results showed that 3D reconstruction is more stable than baselines in metrics and visualization. 3. Using 2D/3D features supports fast indexing, and voting + ICP verifies matches. The memory further supplies constraints that help the global optimization of camera trajectories.
1. VGGT comparison gap. The model initializes pose and depth with VGGT, yet lacks direct, controlled comparisons against VGGT in the visualizations; reconstruction metrics are also close. Given VGGT is not specifically optimized for dynamic scenes, should include more dynamic motion benchmarks and report VGGT vs. your method to substantiate the value of dynamic-mask extraction. 2. Lack evidence for the memory module. While the paper states that memory provides stronger constraints for trajectory
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization
