Towards Long-window Anchoring in Vision-Language Model Distillation
Haoyi Zhou, Shuo Li, Tianyu Chen, Qi Song, Chonghan Gao, Jianxin Li

TL;DR
This paper introduces LAid, a novel distillation method that enhances small vision-language models with long-range attention capabilities, enabling them to process much longer contexts without sacrificing performance.
Contribution
LAid is a new distillation approach that effectively transfers long-range attention mechanisms to small models, significantly extending their context window while maintaining accuracy.
Findings
LAid models achieve up to 3.2x longer effective context windows.
LAid preserves low-frequency attention components better than existing methods.
LAid improves performance on standard VL benchmarks with extended context.
Abstract
While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students' capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2 times longer effective context windows…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
