MOSU: Autonomous Long-range Robot Navigation with Multi-modal Scene Understanding

Jing Liang; Kasun Weerakoon; Daeun Song; Senthurbavan Kirubaharan; Xuesu Xiao; and Dinesh Manocha

arXiv:2507.04686·cs.RO·July 8, 2025

MOSU: Autonomous Long-range Robot Navigation with Multi-modal Scene Understanding

Jing Liang, Kasun Weerakoon, Daeun Song, Senthurbavan Kirubaharan, Xuesu Xiao, and Dinesh Manocha

PDF

TL;DR

MOSU is a comprehensive autonomous navigation system that integrates multimodal perception, including geometric, semantic, and social understanding, to improve outdoor robot navigation over long distances.

Contribution

It introduces a novel multi-modal perception framework combining LiDAR, images, and VLMs for enhanced scene understanding and navigation in outdoor environments.

Findings

01

10% improvement in traversability on navigable terrains

02

Maintains comparable navigation distance to existing methods

03

Effective integration of multimodal data for complex scene understanding

Abstract

We present MOSU, a novel autonomous long-range navigation system that enhances global navigation for mobile robots through multimodal perception and on-road scene understanding. MOSU addresses the outdoor robot navigation challenge by integrating geometric, semantic, and contextual information to ensure comprehensive scene understanding. The system combines GPS and QGIS map-based routing for high-level global path planning and multi-modal trajectory generation for local navigation refinement. For trajectory generation, MOSU leverages multi-modalities: LiDAR-based geometric data for precise obstacle avoidance, image-based semantic segmentation for traversability assessment, and Vision-Language Models (VLMs) to capture social context and enable the robot to adhere to social norms in complex environments. This multi-modal integration improves scene understanding and enhances…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.