RoboRetriever: Single-Camera Robot Object Retrieval via Active and Interactive Perception with Dynamic Scene Graph
Hecheng Wang, Jiankun Ren, Jia Yu, Lizhe Qi, Yunquan Sun

TL;DR
RoboRetriever is a novel framework enabling real-world object retrieval using a single wrist-mounted RGB-D camera, combining dynamic scene graph reasoning, active perception, and manipulation to operate effectively in cluttered environments.
Contribution
It introduces a dynamic hierarchical scene graph and a visual prompting scheme for active perception, allowing single-camera robotic retrieval with natural language instructions.
Findings
Effective in cluttered, real-world scenes
Operates with only a wrist-mounted RGB-D camera
Demonstrates robustness with human intervention
Abstract
Humans effortlessly retrieve objects in cluttered, partially observable environments by combining visual reasoning, active viewpoint adjustment, and physical interaction-with only a single pair of eyes. In contrast, most existing robotic systems rely on carefully positioned fixed or multi-camera setups with complete scene visibility, which limits adaptability and incurs high hardware costs. We present \textbf{RoboRetriever}, a novel framework for real-world object retrieval that operates using only a \textbf{single} wrist-mounted RGB-D camera and free-form natural language instructions. RoboRetriever grounds visual observations to build and update a \textbf{dynamic hierarchical scene graph} that encodes object semantics, geometry, and inter-object relations over time. The supervisor module reasons over this memory and task instruction to infer the target object and coordinate an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Image Retrieval and Classification Techniques
