Towards Long-window Anchoring in Vision-Language Model Distillation

Haoyi Zhou; Shuo Li; Tianyu Chen; Qi Song; Chonghan Gao; Jianxin Li

arXiv:2512.21576·cs.CV·January 6, 2026

Towards Long-window Anchoring in Vision-Language Model Distillation

Haoyi Zhou, Shuo Li, Tianyu Chen, Qi Song, Chonghan Gao, Jianxin Li

PDF

Open Access 1 Video

TL;DR

This paper introduces LAid, a novel distillation method that enhances small vision-language models with long-range attention capabilities, enabling them to process much longer contexts without sacrificing performance.

Contribution

LAid is a new distillation approach that effectively transfers long-range attention mechanisms to small models, significantly extending their context window while maintaining accuracy.

Findings

01

LAid models achieve up to 3.2x longer effective context windows.

02

LAid preserves low-frequency attention components better than existing methods.

03

LAid improves performance on standard VL benchmarks with extended context.

Abstract

While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students' capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2 times longer effective context windows…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Towards Long-window Anchoring in Vision-Language Model Distillation· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning