CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers
Natalia Frumkin, Dibakar Gope, and Diana Marculescu

TL;DR
CPT-V introduces a contrastive, self-supervised method to enhance post-training quantization accuracy of vision transformers by perturbing scales and using a block-wise evolutionary search with minimal calibration data.
Contribution
It proposes a novel contrastive loss-based approach combined with evolutionary search to improve quantized vision transformer performance without retraining.
Findings
Improves ViT-Base top-1 accuracy by up to 10.30% at 3-bit quantization.
Effective across various ViT architectures and quantization levels.
Requires only 1,000 calibration images for optimization.
Abstract
When considering post-training quantization, prior work has typically focused on developing a mixed precision scheme or learning the best way to partition a network for quantization. In our work, CPT-V, we look at a general way to improve the accuracy of networks that have already been quantized, simply by perturbing the quantization scales. Borrowing the idea of contrastive loss from self-supervised learning, we find a robust way to jointly minimize a loss function using just 1,000 calibration images. In order to determine the best performing quantization scale, CPT-V contrasts the features of quantized and full precision models in a self-supervised fashion. Unlike traditional reconstruction-based loss functions, the use of a contrastive loss function not only rewards similarity between the quantized and full precision outputs but also helps in distinguishing the quantized output…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Image Enhancement Techniques · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer
