HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

Yiru Wang; Zichong Gu; Yu Gao; Anqing Jiang; Zhigang Sun; Shuo Wang; Yuwen Heng; Hao Sun

arXiv:2602.13329·cs.CV·February 17, 2026

HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

Yiru Wang, Zichong Gu, Yu Gao, Anqing Jiang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun

PDF

Open Access

TL;DR

HiST-VLA is a hierarchical spatio-temporal vision-language-action model that significantly improves autonomous driving by enhancing spatial reasoning, efficiency, and command grounding, leading to state-of-the-art results on key benchmarks.

Contribution

The paper introduces a novel hierarchical spatio-temporal VLA framework with dynamic token sparsification and a transformer-based planner for end-to-end autonomous driving.

Findings

01

Achieves 88.6 EPDMS on Navtest benchmark.

02

Attains 50.9 EPDMS on Navhard benchmark.

03

Demonstrates state-of-the-art performance on NAVSIM v2.

Abstract

Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding. However, their utilization in safety-critical scenarios is constrained by inherent limitations, including imprecise numerical reasoning, weak 3D spatial awareness, and high sensitivity to context. To address these challenges, we propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation. Our framework enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting. To ensure computational efficiency, we integrate dynamic token sparsification into the VLA architecture. This approach fuses redundant tokens rather than filtering them, effectively reducing redundancy without sacrificing model performance. Furthermore, we employ a hierarchical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutonomous Vehicle Technology and Safety · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning