InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Hongyuan Tao; Bencheng Liao; Shaoyu Chen; Haoran Yin; Qian Zhang; Wenyu Liu; Xinggang Wang

arXiv:2512.08829·cs.CV·April 1, 2026

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang

PDF

1 Repo 2 Models

TL;DR

InfiniteVL introduces a hybrid vision-language model that combines linear and sparse attention mechanisms, achieving high efficiency and scalability for ultra-long multimodal understanding tasks.

Contribution

The paper presents InfiniteVL, a novel hybrid model with a new fine-tuning strategy that enables efficient, unlimited-input vision-language processing with high-frequency visual recall.

Findings

01

InfiniteVL-Base matches Transformer performance with 1.7x speedup.

02

InfiniteVL-Offline achieves 5x prefill acceleration at 256K context.

03

InfiniteVL-Online maintains 25 FPS for real-time streaming.

Abstract

Vision-Language Models (VLMs) are increasingly tasked with ultra-long multimodal understanding. While linear architectures offer constant computation and memory footprints, they often struggle with high-frequency visual perception compared to standard Transformers. To bridge this gap, we introduce \textbf{InfiniteVL}. We first develop a hybrid base model called \textbf{InfiniteVL-Base} that interleaves a small fraction of Full Attention layers with Gated DeltaNet. Empowered by a tailored distillation and fine-tuning strategy, InfiniteVL-Base matches the fundamental multimodal performance of equivalent Transformers while achieving a \textbf{1.7 $\times$ } decoding speedup. However, the quadratic complexity of the retained Full Attention inevitably becomes an efficiency bottleneck when scaling to ultra long context. To break this barrier, we propose a novel Long-Sequence Architectural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hustvl/InfiniteVL
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.