SPEED-Q: Staged Processing with Enhanced Distillation towards Efficient Low-bit On-device VLM Quantization
Tianyu Guo, Shanwei Zhao, Shiai Zhu, Chenguang Ma

TL;DR
SPEED-Q introduces a staged, distillation-based quantization framework that enables efficient low-bit deployment of large vision-language models on edge devices, significantly improving accuracy and stability.
Contribution
It is the first to systematically address aggressive quantization of billion-parameter VLMs using staged sensitivity adaptation and distillation, improving stability and performance.
Findings
Achieves up to 6x higher accuracy than existing methods at 2-bit quantization.
Outperforms prior on-device VLMs at 2-bit and 4-bit settings.
Enables stable, data-efficient quantization of large VLMs.
Abstract
Deploying Vision-Language Models (VLMs) on edge devices (e.g., smartphones and robots) is crucial for enabling low-latency and privacy-preserving intelligent applications. Given the resource constraints of these devices, quantization offers a promising solution by improving memory efficiency and reducing bandwidth requirements, thereby facilitating the deployment of VLMs. However, existing research has rarely explored aggressive quantization on VLMs, particularly for the models ranging from 1B to 2B parameters, which are more suitable for resource-constrained edge devices. In this paper, we propose SPEED-Q, a novel Staged Processing with Enhanced Distillation framework for VLM low-bit weight-only quantization that systematically addresses the following two critical obstacles: (1) significant discrepancies in quantization sensitivity between vision (ViT) and language (LLM) components in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Data Compression Techniques
