# Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation

**Authors:** Yiguo Fan, Pengxiang Ding, Shuanghao Bai, Xinyang Tong, Yuyang Zhu, Hongchao Lu, Fengqi Dai, Wei Zhao, Yang Liu, Siteng Huang, Zhaoxin Fan, Badong Chen, Donglin Wang

arXiv: 2508.19958 · 2025-08-29

## TL;DR

Long-VLA introduces a novel vision-language-action model tailored for long-horizon robotic tasks, employing phase-aware input masking to improve skill chaining and subtask handling, significantly advancing robotic manipulation capabilities.

## Contribution

The paper presents the first end-to-end VLA model designed specifically for long-horizon tasks, with a novel phase-aware masking strategy and a new benchmark for evaluation.

## Key findings

- Long-VLA outperforms previous methods on simulated tasks.
- The phase-aware masking improves subtask segmentation.
- The model is effective in real-world robotic manipulation.

## Abstract

Vision-Language-Action (VLA) models have become a cornerstone in robotic policy learning, leveraging large-scale multimodal data for robust and scalable control. However, existing VLA frameworks primarily address short-horizon tasks, and their effectiveness on long-horizon, multi-step robotic manipulation remains limited due to challenges in skill chaining and subtask dependencies. In this work, we introduce Long-VLA, the first end-to-end VLA model specifically designed for long-horizon robotic tasks. Our approach features a novel phase-aware input masking strategy that adaptively segments each subtask into moving and interaction phases, enabling the model to focus on phase-relevant sensory cues and enhancing subtask compatibility. This unified strategy preserves the scalability and data efficiency of VLA training, and our architecture-agnostic module can be seamlessly integrated into existing VLA models. We further propose the L-CALVIN benchmark to systematically evaluate long-horizon manipulation. Extensive experiments on both simulated and real-world tasks demonstrate that Long-VLA significantly outperforms prior state-of-the-art methods, establishing a new baseline for long-horizon robotic control.

---
Source: https://tomesphere.com/paper/2508.19958