HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

Yihao Liang; Niraj K. Jha

arXiv:2605.17093·cs.CV·May 19, 2026

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

Yihao Liang, Niraj K. Jha

PDF

TL;DR

HEED introduces a density-weighted residual alignment method for distilling vision-language models, significantly improving performance on OCR and document tasks while maintaining efficiency.

Contribution

The paper proposes HEED, a novel density-weighted residual alignment technique that enhances model distillation by focusing on high-information patches, leading to substantial performance gains.

Findings

01

HEED improves OCRBench v2 scores by 8.7 points.

02

HEED achieves a 5.13-point increase on a 10-benchmark average.

03

The method maintains teacher-level performance with 4.12× throughput and 68% memory savings.

Abstract

Distilling vision-language models into faster hybrid architectures, such as 3:1 Mamba-2/attention mixes, is now standard practice for making inference efficient. Aggregate benchmarks suggest that this works but they hide selective failures. When we distill Qwen3-VL-8B-Instruct into a 3:1 Mamba-2/attention hybrid, student model stays within 2 points of the teacher across visual reasoning benchmarks like MMStar, MMBench, and MMMU-Pro, while dropping 13 points on optical-character-recognition and document tasks. The student can still understand the scene but loses the fine-grained text needed to answer. We localize much of the failure to a specific kind of position. In a high-resolution image, most patches are sky, wall, or smooth texture, while a small fraction carries text, edges, object boundaries, or other local details. In a token-level diagnostic, the top 10% highest-density patches…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.