Q-SAM2: Accurate Quantization for Segment Anything Model 2
Nicola Farronato, Florian Scheidegger, Mattia Rigotti, Cristiano Malossi, Michele Magno, Haotong Qin

TL;DR
Q-SAM2 introduces a novel low-bit quantization method for the Segment Anything Model 2, significantly reducing model size and computational costs while maintaining high segmentation accuracy through innovative calibration and clipping techniques.
Contribution
The paper proposes Variance-Reduced Calibration and Learnable Statistical Clipping, novel methods that improve low-bit quantization performance for SAM2.
Findings
Achieves up to 9.7 percentage points improvement in video segmentation accuracy.
Reduces model size by 8 times compared to BF16 baseline.
Outperforms state-of-the-art quantization schemes in ultra-low 2-bit regime.
Abstract
The Segment Anything Model 2 (SAM2) is a powerful foundation model for promptable segmentation. However, its high computational and memory costs are a major barrier to deployment on resource-constrained devices. In this paper, we present Q-SAM2, an accurate low-bit quantization method that achieves high compression and high fidelity. To address performance degradation arising from challenging weight and activation distributions during quantization, Q-SAM2 introduces two novel contributions: Variance-Reduced Calibration (VRC), an initialization method that reduces weight statistical variance by minimizing the Frobenius norm over a small calibration batch; and Learnable Statistical Clipping (LSC), a Quantization-Aware Training (QAT) method that learns momentum-stabilized clipping factors to manage outliers in weights and activations. Comprehensive experiments demonstrate that Q-SAM2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Image and Video Quality Assessment
MethodsSoftmax · Attention Is All You Need · Linear Layer
