QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization
Yuhao Xu, Yantai Yang, Zhenyang Fan, Yufan Liu, Yuming Li, Bing Li, Zhipeng Zhang

TL;DR
This paper introduces QVLA, a novel action-centric quantization framework for Vision-Language-Action models that significantly reduces model size and computational requirements while preserving high performance, enabling deployment on resource-limited robotic platforms.
Contribution
QVLA presents a channel-wise, importance-guided quantization method specifically designed for embodied control, improving over uniform-bit approaches by considering action-space sensitivity.
Findings
Achieves 29.2% VRAM usage with 98.9% performance retention.
Outperforms LLM-based quantization methods by 22.6%.
Provides a unified framework for quantization and pruning in VLA models.
Abstract
The advent of Vision-Language-Action (VLA) models represents a significant leap for embodied intelligence, yet their immense computational demands critically hinder deployment on resource-constrained robotic platforms. Intuitively, low-bit quantization is a prevalent and preferred technique for large-scale model compression. However, we find that a systematic analysis of VLA model's quantization is fundamentally lacking. We argue that naively applying uniform-bit quantization from Large Language Models (LLMs) to robotics is flawed, as these methods prioritize passive data fidelity while ignoring how minor action deviations compound into catastrophic task failures. To bridge this gap, we introduce QVLA, the first action-centric quantization framework specifically designed for embodied control. In a sharp departure from the rigid, uniform-bit quantization of LLM-based methods, QVLA…
Peer Reviews
Decision·ICLR 2026 Poster
1. The algorithm defined in the paper is novel way of quantizing Video Language Action models 2. They efficiently allocate the computation budget for identifying the sensitivity of different channels 3. They empirically show that their algorithm achieves performance close to the full precision model, while reducing the VRAM required and achieving speedup
1. The evaluation is focused on one class of models and a single benchmark; it is not clear how the method generalizes across models and benchmarks 2. No detailed ablation studies highlighting the contribution of different parts of the method, like the calibration set sizes, gate ratio selection etc., 3. No theoretical analysis of the choices made in the paper
1.The paper introduces a fine-grained approach to quantifying errors by isolating individual channels, allowing for a more precise evaluation of the error for each channel at different bit-widths, ensuring better control over the quantization process. 2.Unlike traditional global bit allocation, the paper proposes a per-channel adaptive bit allocation strategy and employs a greedy search algorithm to optimize bit allocation. The algorithm dynamically adjusts the bit-width for each channel based o
1. The sensitivity evaluation and bit allocation are only conducted on a few VLA models in the paper. It remains unclear whether the same process needs to be applied to other VLA models, and if so, whether it will lead to significantly higher computational resource consumption. A more generalizable analysis across a broader range of VLA models such as UniVLA[1] would provide better insight into the scalability of the proposed method. 2. Inadequate comparison with relevant baselines. Since the VL
1. Conducted a detailed analysis of the sensitivity of parameters across different channels to action generation; 2. Significant memory reduction and speedup with minimal performance loss make large VLA deployment feasible.
All experiments are conducted on LIBERO benchmarks, where there is a lack of research on model quantization in real-world tasks.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Advanced Memory and Neural Computing
