LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies
I Made Aswin Nahrendra, Seunghyun Lee, Dongkyu Lee, Hyun Myung

TL;DR
This paper introduces LocoVLM, a novel framework that integrates language and vision models to enable real-time, instruction-guided adaptation of legged robot locomotion, enhancing responsiveness to high-level semantic cues.
Contribution
It combines foundation models with locomotion policies to enable semantic understanding and real-time adaptation without cloud dependency.
Findings
Achieves up to 87% instruction-following accuracy
Enables real-time semantic-grounded locomotion adaptation
First to demonstrate high-level reasoning for legged robot control
Abstract
Recent advances in legged locomotion learning are still dominated by the utilization of geometric representations of the environment, limiting the robot's capability to respond to higher-level semantics such as human instructions. To address this limitation, we propose a novel approach that integrates high-level commonsense reasoning from foundation models into the process of legged locomotion adaptation. Specifically, our method utilizes a pre-trained large language model to synthesize an instruction-grounded skill database tailored for legged robots. A pre-trained vision-language model is employed to extract high-level environmental semantics and ground them within the skill database, enabling real-time skill advisories for the robot. To facilitate versatile skill control, we train a style-conditioned policy capable of generating diverse and robust locomotion skills with high fidelity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotic Locomotion and Control · Social Robot Interaction and HRI · Robot Manipulation and Learning
