MLLM-Search: A Zero-Shot Approach to Finding People using Multimodal Large Language Models
Angus Fung, Aaron Hao Tan, Haitong Wang, Beno Benhabib, Goldie Nejat

TL;DR
MLLM-Search introduces a zero-shot, multimodal large language model-based architecture for autonomous robot search of people in dynamic, real-world environments, leveraging spatial understanding and semantic reasoning.
Contribution
It presents a novel visual prompting method and spatial chain-of-thought prompting to enhance robot search capabilities without prior knowledge.
Findings
Outperforms existing search methods in efficiency
Successfully generalizes to unseen environments
Validated through extensive 3D and real-world experiments
Abstract
Robotic search of people in human-centered environments, including healthcare settings, is challenging as autonomous robots need to locate people without complete or any prior knowledge of their schedules, plans or locations. Furthermore, robots need to be able to adapt to real-time events that can influence a person's plan in an environment. In this paper, we present MLLM-Search, a novel zero-shot person search architecture that leverages multimodal large language models (MLLM) to address the mobile robot problem of searching for a person under event-driven scenarios with varying user schedules. Our approach introduces a novel visual prompting method to provide robots with spatial understanding of the environment by generating a spatially grounded waypoint map, representing navigable waypoints by a topological graph and regions by semantic labels. This is incorporated into a MLLM with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling
