HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation
Yihao Liang, Niraj K. Jha

TL;DR
HEED introduces a density-weighted residual alignment method for distilling vision-language models, significantly improving performance on OCR and document tasks while maintaining efficiency.
Contribution
The paper proposes HEED, a novel density-weighted residual alignment technique that enhances model distillation by focusing on high-information patches, leading to substantial performance gains.
Findings
HEED improves OCRBench v2 scores by 8.7 points.
HEED achieves a 5.13-point increase on a 10-benchmark average.
The method maintains teacher-level performance with 4.12× throughput and 68% memory savings.
Abstract
Distilling vision-language models into faster hybrid architectures, such as 3:1 Mamba-2/attention mixes, is now standard practice for making inference efficient. Aggregate benchmarks suggest that this works but they hide selective failures. When we distill Qwen3-VL-8B-Instruct into a 3:1 Mamba-2/attention hybrid, student model stays within 2 points of the teacher across visual reasoning benchmarks like MMStar, MMBench, and MMMU-Pro, while dropping 13 points on optical-character-recognition and document tasks. The student can still understand the scene but loses the fine-grained text needed to answer. We localize much of the failure to a specific kind of position. In a high-resolution image, most patches are sky, wall, or smooth texture, while a small fraction carries text, edges, object boundaries, or other local details. In a token-level diagnostic, the top 10% highest-density patches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
