Constrained Robotic Navigation on Preferred Terrains Using LLMs and Speech Instruction: Exploiting the Power of Adverbs
Faraz Lotfi, Farnoosh Faraji, Nikhil Kakodkar, Travis Manderson, David, Meger, and Gregory Dudek

TL;DR
This paper presents a novel approach to off-road robotic navigation that uses large language models and speech instructions to interpret high-level commands, identify landmarks and terrains, and control vehicle movement without relying on extensive data collection.
Contribution
The paper introduces a method combining LLMs, speech-to-text, semantic segmentation, and MPC control for map-free, terrain-preferred navigation based on verbal instructions and adverbs.
Findings
Effective interpretation of verbal instructions for navigation.
Successful identification of landmarks and terrains from images.
Enhanced adaptability to diverse off-road environments.
Abstract
This paper explores leveraging large language models for map-free off-road navigation using generative AI, reducing the need for traditional data collection and annotation. We propose a method where a robot receives verbal instructions, converted to text through Whisper, and a large language model (LLM) model extracts landmarks, preferred terrains, and crucial adverbs translated into speed settings for constrained navigation. A language-driven semantic segmentation model generates text-based masks for identifying landmarks and terrain types in images. By translating 2D image points to the vehicle's motion plane using camera parameters, an MPC controller can guides the vehicle towards the desired terrain. This approach enhances adaptation to diverse environments and facilitates the use of high-level instructions for navigating complex and challenging terrains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
