TL;DR
GRACE is a novel framework combining knowledge distillation and quantization-aware training based on the Information Bottleneck principle, enabling efficient vision-language models with minimal accuracy loss.
Contribution
It introduces confidence-gated distillation, relational kernel alignment, and an adaptive controller to improve quantization of VLMs, outperforming existing methods.
Findings
INT4 models outperform FP16 baselines on benchmarks
Nearly match teacher performance with significant resource savings
Achieve 3x throughput and 54% memory reduction
Abstract
Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
