LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks
Yi Yang, Jiaxuan Sun, Siqi Kou, Yihan Wang, Zhijie Deng

TL;DR
LoHoVLA is a unified vision-language-action framework that leverages a pretrained vision language model and hierarchical control to improve long-horizon embodied task performance in simulation.
Contribution
It introduces LoHoVLA, a novel unified model combining vision, language, and action for long-horizon tasks, with a hierarchical control mechanism and a new dataset LoHoSet.
Findings
Outperforms existing hierarchical and standard VLA models on long-horizon tasks
Demonstrates better generalization across diverse tasks
Shows significant improvements in the Ravens simulator
Abstract
Real-world embodied agents face long-horizon tasks, characterized by high-level goals demanding multi-step solutions beyond single actions. Successfully navigating these requires both high-level task planning (i.e., decomposing goals into sub-tasks) and low-level motion control (i.e., generating precise robot actions). While existing vision language action (VLA) models and hierarchical architectures offer potential in embodied tasks, the former often falter in planning, and the latter can suffer from coordination issues, both hampering performance. We introduce a new unified VLA framework for long-horizon tasks, dubbed LoHoVLA, to overcome these limitations. LoHoVLA leverages a large pretrained vision language model (VLM) as the backbone to jointly generate language and action tokens for sub-task generation and robot action prediction, respectively. This shared representation promotes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robotics and Automated Systems
