TL;DR
This paper introduces SplitQ, a novel post-training quantization framework that effectively addresses cross-modal heterogeneity in low-bit quantization of vision-language models, significantly improving accuracy on multiple datasets.
Contribution
SplitQ employs modality-specific outlier channel decoupling and adaptive cross-modal calibration to enhance low-bit quantization of VLMs, outperforming existing methods.
Findings
SplitQ preserves 93.5% of FP16 performance under W3A3 setting.
It outperforms existing approaches across 6 multi-modal datasets.
SplitQ is effective under various quantization settings including W4A8, W4A4, W3A3, and W3A2.
Abstract
Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
