LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks

Yi Yang; Jiaxuan Sun; Siqi Kou; Yihan Wang; Zhijie Deng

arXiv:2506.00411·cs.RO·June 3, 2025

LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks

Yi Yang, Jiaxuan Sun, Siqi Kou, Yihan Wang, Zhijie Deng

PDF

Open Access

TL;DR

LoHoVLA is a unified vision-language-action framework that leverages a pretrained vision language model and hierarchical control to improve long-horizon embodied task performance in simulation.

Contribution

It introduces LoHoVLA, a novel unified model combining vision, language, and action for long-horizon tasks, with a hierarchical control mechanism and a new dataset LoHoSet.

Findings

01

Outperforms existing hierarchical and standard VLA models on long-horizon tasks

02

Demonstrates better generalization across diverse tasks

03

Shows significant improvements in the Ravens simulator

Abstract

Real-world embodied agents face long-horizon tasks, characterized by high-level goals demanding multi-step solutions beyond single actions. Successfully navigating these requires both high-level task planning (i.e., decomposing goals into sub-tasks) and low-level motion control (i.e., generating precise robot actions). While existing vision language action (VLA) models and hierarchical architectures offer potential in embodied tasks, the former often falter in planning, and the latter can suffer from coordination issues, both hampering performance. We introduce a new unified VLA framework for long-horizon tasks, dubbed LoHoVLA, to overcome these limitations. LoHoVLA leverages a large pretrained vision language model (VLM) as the backbone to jointly generate language and action tokens for sub-task generation and robot action prediction, respectively. This shared representation promotes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robotics and Automated Systems