TL;DR
This paper introduces a map-free method for 3D object localization that relies on storing posed RGB-D keyframes, enabling faster scene indexing and effective robot navigation without dense 3D reconstruction.
Contribution
It proposes a lightweight visual memory approach that bypasses traditional 3D scene reconstruction, reducing preprocessing time and storage while maintaining strong localization performance.
Findings
Scene indexing is over 100 times faster than reconstruction-based methods.
The approach achieves competitive object localization accuracy without dense 3D models.
The method performs well on downstream object-goal navigation benchmarks.
Abstract
Target localization is a prerequisite for embodied tasks such as navigation and manipulation. Conventional approaches rely on constructing explicit 3D scene representations to enable target localization, such as point clouds, voxel grids, or scene graphs. While effective, these pipelines incur substantial mapping time, storage overhead, and scalability limitations. Recent advances in vision-language models suggest that rich semantic reasoning can be performed directly on 2D observations, raising a fundamental question: is a complete 3D scene reconstruction necessary for object localization? In this work, we revisit object localization and propose a map-free pipeline that stores only posed RGB-D keyframes as a lightweight visual memory--without constructing any global 3D representation of the scene. At query time, our method retrieves candidate views, re-ranks them with a vision-language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
