LiteVLA-Edge: Quantized On-Device Multimodal Control for Embedded Robotics
Justin Williams, Kishor Datta Gupta, Roy George, Mrinmoy Sarkar

TL;DR
LiteVLA-Edge is a practical system enabling real-time, fully on-device vision-language-action processing for embedded robots, combining quantization and GPU acceleration to achieve low latency.
Contribution
It introduces a deployment-oriented pipeline for running compact multimodal control models locally on embedded hardware with preserved modular interfaces.
Findings
Achieves 150.5 ms latency (6.6 Hz) on Jetson Orin hardware.
Operates entirely offline within a ROS 2 pipeline.
Provides a reproducible baseline for on-device VLA in robotics.
Abstract
Vision-Language-Action (VLA) models provide a unified framework for perception, language conditioning, and action generation, but many existing systems remain difficult to deploy in embedded robotic settings because of their computational requirements and inference latency. In this paper, we present LiteVLA-Edge, a deployment-oriented VLA pipeline for fully on-device inference on Jetson Orin-class hardware. Our approach combines supervised image-to-action fine-tuning in FP32 with post-training 4-bit GGUF quantization and GPU-accelerated inference through the \texttt{llama.cpp} runtime. Under our deployment configuration, LiteVLA-Edge achieves a mean end-to-end latency of 150.5\,ms (approximately 6.6\,Hz) while operating entirely offline within a ROS~2-integrated perception--reasoning--action pipeline. Rather than introducing a new policy objective, our contribution is a practical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robot Manipulation and Learning
