BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
Hongyu Wang, Chuyan Xiong, Ruiping Wang, Xilin Chen

TL;DR
BitVLA introduces a fully native 1-bit vision-language-action model for robotics, significantly reducing memory and latency while maintaining strong task performance, enabling efficient deployment on edge devices.
Contribution
The paper presents BitVLA, a novel 1-bit VLA model built on 1-bit LLM, with a new quantize-then-distill strategy for vision encoder compression, achieving high efficiency without sacrificing accuracy.
Findings
Matches full-precision baseline performance
Reduces model memory by 11.0x
Lowers end-to-end latency by 4.4x
Abstract
Deploying powerful Vision-Language-Action (VLA) models on edge devices is limited by their massive size. In this paper, we take a deployment-oriented view of VLA training: we target efficiency through model design and optimization, rather than relying solely on post-hoc compression. Thus, we propose BitVLA, a fully native 1-bit VLA model for robotic manipulation, where every parameters is ternary, i.e., {-1,0,1}. BitVLA is built on the publicly available 1-bit LLM BitNet b1.58 2B4T, and is trained as a vision-language-action policy that inherits the compactness of 1-bit pretraining while retaining strong task performance. To further reduce the memory footprint of the vision backbone, we introduce Quantize-then-Distill, a post-training quantization-aware training strategy that compresses a full-precision vision encoder to 1.58-bit weights, while a full-precision teacher guides…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsALIGN
