LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments
Ivy Xiao He, Stefanie Tellex, Jason Xinyu Liu

TL;DR
LEGS-POMDP is a modular system that combines language, gesture, and visual data within a POMDP framework to improve open-world object search under uncertainty, demonstrated through simulation and real-world experiments.
Contribution
It introduces a novel POMDP-based approach that explicitly models uncertainty over object identity and location, integrating multimodal inputs for robust open-world object search.
Findings
Achieved 89% success rate in simulation environments.
Multimodal fusion outperforms unimodal baselines.
Validated system on a quadruped robot in real-world scenarios.
Abstract
To assist humans in open-world environments, robots must interpret ambiguous instructions to locate desired objects. Foundation model-based approaches excel at multimodal grounding, but they lack a principled mechanism for modeling uncertainty in long-horizon tasks. In contrast, Partially Observable Markov Decision Processes (POMDPs) provide a systematic framework for planning under uncertainty but are often limited in supported modalities and rely on restrictive environment assumptions. We introduce LanguagE and Gesture-Guided Object Search in Partially Observable Environments (LEGS-POMDP), a modular POMDP system that integrates language, gesture, and visual observations for open-world object search. Unlike prior work, LEGS-POMDP explicitly models two sources of partial observability: uncertainty over the target object's identity and its spatial location. In simulation, multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning
