Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System
Haokun Liu, Zhaoqi Ma, Yunong Li, Junichiro Sugihara, Yicheng Chen, Jinjie Li, and Moju Zhao

TL;DR
This paper introduces a hierarchical multimodal framework combining large language models and vision-language models to enable robust semantic navigation and manipulation in heterogeneous aerial-ground robotic systems, demonstrating adaptability and coordination in dynamic environments.
Contribution
It presents the first integrated aerial-ground robotic system using VLM and LLM for high-level reasoning, perception, and control, enhancing generalizability and task performance.
Findings
Successful real-world validation on long-horizon object tasks.
Zero-shot adaptability to new scenarios.
Enhanced spatial accuracy with GridMask for manipulation.
Abstract
Heterogeneous multirobot systems show great potential in complex tasks requiring coordinated hybrid cooperation. However, existing methods that rely on static or task-specific models often lack generalizability across diverse tasks and dynamic environments. This highlights the need for generalizable intelligence that can bridge high-level reasoning with low-level execution across heterogeneous agents. To address this, we propose a hierarchical multimodal framework that integrates a prompted large language model (LLM) with a fine-tuned vision-language model (VLM). At the system level, the LLM performs hierarchical task decomposition and constructs a global semantic map, while the VLM provides semantic perception and object localization, where the proposed GridMask significantly enhances the VLM's spatial accuracy for reliable fine-grained manipulation. The aerial robot leverages this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
