TL;DR
QuantVLA introduces a novel, training-free post-training quantization framework for vision-language-action models, significantly reducing memory and compute demands while maintaining high task success rates.
Contribution
It is the first PTQ method for VLA systems and successfully quantizes a diffusion transformer action head, enabling scalable low-bit embodied intelligence.
Findings
Exceeds full-precision baseline success rates on LIBERO tasks.
Achieves approximately 70% relative memory savings.
Supports low-bit integer kernels without architecture changes.
Abstract
Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
