Language-Conditioned Observation Models for Visual Object Search
Thao Nguyen, Vladislav Hrosinkov, Eric Rosen, Stefanie Tellex

TL;DR
This paper introduces a language-conditioned observation model for robotic visual object search, enabling robots to understand complex language descriptions and adapt their detection strategies dynamically, improving search success rates.
Contribution
The work presents a novel neural network-based observation model that conditions object detection and noise modeling on language descriptions, allowing flexible and scalable object search.
Findings
Significantly improved task completion rate from 0.46 to 0.66 in simulation.
Demonstrated successful real-world deployment on a Boston Dynamics Spot robot.
Outperformed fixed-noise models in efficiency and speed of object search.
Abstract
Object search is a challenging task because when given complex language descriptions (e.g., "find the white cup on the table"), the robot must move its camera through the environment and recognize the described object. Previous works map language descriptions to a set of fixed object detectors with predetermined noise models, but these approaches are challenging to scale because new detectors need to be made for each object. In this work, we bridge the gap in realistic object search by posing the search problem as a partially observable Markov decision process (POMDP) where the object detector and visual sensor noise in the observation model is determined by a single Deep Neural Network conditioned on complex language descriptions. We incorporate the neural network's outputs into our language-conditioned observation model (LCOM) to represent dynamically changing sensor noise. With an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
