Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi; Brian Ichter; Michael Equi; Liyiming Ke; Karl Pertsch; Quan Vuong; James Tanner; Anna Walling; Haohuan Wang; Niccolo Fusai; Adrian Li-Bell; Danny Driess; Lachy Groom; Sergey Levine; Chelsea Finn

arXiv:2502.19417·cs.RO·July 16, 2025

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, Chelsea Finn

PDF

Open Access 1 Video

TL;DR

This paper introduces a hierarchical vision-language-action system for generalist robots capable of understanding complex instructions and feedback, enabling them to perform diverse tasks across different robotic platforms.

Contribution

The work presents a novel hierarchical model that integrates reasoning over complex prompts and feedback with low-level action execution for versatile robot task performance.

Findings

01

Successfully handled complex instructions and feedback in real-world tasks

02

Demonstrated across multiple robotic platforms including single-arm and dual-arm robots

03

Achieved effective task completion in scenarios like cleaning, cooking, and shopping

Abstract

Generalist robots that can perform a range of different tasks in open-world settings must be able to not only reason about the steps needed to accomplish their goals, but also process complex instructions, prompts, and even feedback during task execution. Intricate instructions (e.g., "Could you make me a vegetarian sandwich?" or "I don't like that one") require not just the ability to physically perform the individual steps, but the ability to situate complex commands and feedback in the physical world. In this work, we describe a system that uses vision-language models in a hierarchical structure, first reasoning over complex prompts and user feedback to deduce the most appropriate next step to fulfill the task, and then performing that step with low-level actions. In contrast to direct instruction following methods that can fulfill simple commands ("pick up the cup"), our system can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI