LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception

Justin williams; Kishor Datta Gupta; Roy George; Mrinmoy Sarkar

arXiv:2605.00884·cs.CV·May 12, 2026

LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception

Justin williams, Kishor Datta Gupta, Roy George, Mrinmoy Sarkar

PDF

TL;DR

LiteVLA-H is a compact dual-rate vision-language-action model optimized for onboard aerial guidance, achieving low-latency reactive control and semantic perception on embedded platforms.

Contribution

The paper introduces LiteVLA-H, a novel dual-rate VLA system with a scheduler for real-time aerial guidance and semantic understanding on edge devices.

Findings

01

Reactive action tokens issued at 19.74 Hz on embedded hardware.

02

Semantic outputs maintained at 6.08--6.67 Hz with low latency.

03

Outperforms recent state-of-the-art architectures in edge inference rate.

Abstract

Vision-language-action (VLA) models have shown strong semantic grounding and task generalization in manipulation, but aerial deployment remains difficult because drones require low-latency closed-loop guidance under strict onboard compute and communication constraints. We present LiteVLA-H, a compact 256M-parameter VLA system designed for dual-rate operation on an NVIDIA Jetson AGX Orin: a fast outer-loop guidance mode for short action-token outputs and a slower semantic mode for scene understanding, hazard description, and operator-facing narration. The central empirical observation is that, in this compact edge regime, end-to-end latency is dominated by multimodal pre-fill rather than by the marginal cost of decoding a few extra tokens. This motivates a scheduler that issues reactive action tokens at 50.65,ms (19.74,Hz) while still supporting sentence-level semantic outputs at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.