Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

Yi Zhong; Haotong Qin; Xindong Zhang; Lei Zhang; Guolei Sun

arXiv:2605.19929·cs.CV·May 20, 2026

Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

Yi Zhong, Haotong Qin, Xindong Zhang, Lei Zhang, Guolei Sun

PDF

1 Repo

TL;DR

This paper introduces SplitQ, a novel post-training quantization framework that effectively addresses cross-modal heterogeneity in low-bit quantization of vision-language models, significantly improving accuracy on multiple datasets.

Contribution

SplitQ employs modality-specific outlier channel decoupling and adaptive cross-modal calibration to enhance low-bit quantization of VLMs, outperforming existing methods.

Findings

01

SplitQ preserves 93.5% of FP16 performance under W3A3 setting.

02

It outperforms existing approaches across 6 multi-modal datasets.

03

SplitQ is effective under various quantization settings including W4A8, W4A4, W3A3, and W3A2.

Abstract

Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

EMVision-NK/SplitQ
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.