GSON: A Group-based Social Navigation Framework with Large Multimodal Model

Shangyi Luo; Peng Sun; Ji Zhu; Yuhong Deng; Cunjun Yu; Anxing Xiao; Xueqian Wang

arXiv:2409.18084·cs.RO·July 30, 2025

GSON: A Group-based Social Navigation Framework with Large Multimodal Model

Shangyi Luo, Peng Sun, Ji Zhu, Yuhong Deng, Cunjun Yu, Anxing Xiao, Xueqian Wang

PDF

Open Access

TL;DR

GSON is a social navigation framework that uses large multimodal models and visual prompts to improve robots' social perception and behavior in complex human environments.

Contribution

The paper introduces GSON, a novel group-based social navigation system leveraging LMMs and a mid-level planner to enhance social awareness and responsiveness.

Findings

01

Outperforms existing methods in social perturbation metrics

02

Maintains traditional navigation performance levels

03

Effectively handles complex social scenarios in real-world tests

Abstract

With the increasing presence of service robots and autonomous vehicles in human environments, navigation systems need to evolve beyond simple destination reach to incorporate social awareness. This paper introduces GSON, a novel group-based social navigation framework that leverages Large Multimodal Models (LMMs) to enhance robots' social perception capabilities. Our approach uses visual prompting to enable zero-shot extraction of social relationships among pedestrians and integrates these results with robust pedestrian detection and tracking pipelines to overcome the inherent inference speed limitations of LMMs. The planning system incorporates a mid-level planner that sits between global path planning and local motion planning, effectively preserving both global context and reactive responsiveness while avoiding disruption of the predicted social group. We validate GSON through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Geographic Information Systems Studies

Methodstravel james · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings