GSON: A Group-based Social Navigation Framework with Large Multimodal Model
Shangyi Luo, Peng Sun, Ji Zhu, Yuhong Deng, Cunjun Yu, Anxing Xiao, Xueqian Wang

TL;DR
GSON is a social navigation framework that uses large multimodal models and visual prompts to improve robots' social perception and behavior in complex human environments.
Contribution
The paper introduces GSON, a novel group-based social navigation system leveraging LMMs and a mid-level planner to enhance social awareness and responsiveness.
Findings
Outperforms existing methods in social perturbation metrics
Maintains traditional navigation performance levels
Effectively handles complex social scenarios in real-world tests
Abstract
With the increasing presence of service robots and autonomous vehicles in human environments, navigation systems need to evolve beyond simple destination reach to incorporate social awareness. This paper introduces GSON, a novel group-based social navigation framework that leverages Large Multimodal Models (LMMs) to enhance robots' social perception capabilities. Our approach uses visual prompting to enable zero-shot extraction of social relationships among pedestrians and integrates these results with robust pedestrian detection and tracking pipelines to overcome the inherent inference speed limitations of LMMs. The planning system incorporates a mid-level planner that sits between global path planning and local motion planning, effectively preserving both global context and reactive responsiveness while avoiding disruption of the predicted social group. We validate GSON through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Geographic Information Systems Studies
Methodstravel james · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
