BlindSight: Harnessing Sparsity for Efficient Vision-Language Models

Tharun Adithya Srikrishnan; Deval Shah; Timothy Hein; Ahmed Hasssan; Stephen Youn; Steven K. Reinhardt

arXiv:2507.09071·cs.CV·February 2, 2026

BlindSight: Harnessing Sparsity for Efficient Vision-Language Models

Tharun Adithya Srikrishnan, Deval Shah, Timothy Hein, Ahmed Hasssan, Stephen Youn, Steven K. Reinhardt

PDF

Open Access 3 Reviews

TL;DR

BlindSight leverages inherent sparsity in vision-language models' attention mechanisms to significantly speed up multi-image inference without substantial accuracy loss, enabling more efficient processing of large prompts.

Contribution

The paper introduces BlindSight, a novel method that exploits attention sparsity in VLMs to optimize multi-image inference with no runtime overhead.

Findings

01

Achieves 1.8-3.2x speedup in attention computation.

02

Maintains 99.22% accuracy on multi-image benchmarks.

03

Generalizes across multiple VLM architectures.

Abstract

Large vision-language models (VLMs) enable joint processing of text and images. However, incorporating vision data significantly increases the prompt length, resulting in a longer time to first token (TTFT). This bottleneck can be alleviated by leveraging the inherent sparsity in the attention computation. Analyzing these attention patterns in VLMs when processing a series of images, we observe the absence of inter-image attention in a substantial portion of layers. Based on this, we propose BlindSight: an approach to optimize multi-image VLM inference using an input-template-aware attention sparsity mask with no runtime overhead. We utilize a dataset to derive a prompt-agnostic categorization for attention heads: Dense, Sink, Intra-Image, and Intra-Image+Sink. We develop a Triton-based GPU kernel to leverage this sparsity. BlindSight achieves a 1.8-3.2x speedup in the attention…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

BlindSight offers significant inference acceleration for multi-image vision-language models by exploiting attention sparsity without requiring extra training or changes to model architecture. It maintains almost the same accuracy as dense attention, showing an average accuracy degradation of only about 1.15% across major benchmarks.

Weaknesses

BlindSight relies on predefined sparsity patterns, so it may not capture context-dependent attention dynamics that could be important for some prompts or tasks. The minimal accuracy drop is measured only on major benchmarks; specific cases or other domains might experience higher accuracy degradation. Integration requires careful attention boundary detection, and underlying model changes (e.g., image tokenization strategy) may affect its effectiveness or compatibility.

Reviewer 02Rating 4Confidence 2

Strengths

1. The paper's categorization of specific sparsity patterns (Intra-Image, Sink) in multi-image VLMs and linking this sparsity to modality boundary tokens (e.g., <image_start>) is a valuable insight. 2. BlindSight is a training-free method, which means it can be readily applied to existing pre-trained models without costly retraining, making it highly practical. 3. The paper goes beyond theoretical analysis by developing a custom Triton GPU kernel, demonstrating a clear path to translating this

Weaknesses

1. The paper's claim of an "average accuracy degradation of only 1.15%" is misleading. A closer look at Table 1 reveals significant performance drops on certain benchmarks. For example: On Qwen2.5-VL (32B), the MMIU benchmark drops from 44.67 to 41.49 (an absolute drop of 3.18 points, or ~7.1% relative degradation). On Gemma 3 (12B), the MUIRBench benchmark drops from 50.64 to 46.62 (an absolute drop of 4.02 points, or ~7.9% relative degradation). These are substantial performance hits that cann

Reviewer 03Rating 2Confidence 4

Strengths

- This study empirically characterizes four recurring head-level attention patterns across major VLM families such as Qwen and Gemma, providing insights into modality-aware sparsity in multimodal transformers. - This study presents a Triton-based attention kernel tailored for the proposed method, achieving performance gains in realistic long-context inference tasks.

Weaknesses

- I think this work should be compared with previous vision token pruning and token merging methods for VLMs, such as FastV, LLaVA-PruMerge, DivPrune, and DART, in terms of reducing the computational cost of vision tokens. Currently, only a comparison with the original model as the baseline is provided, and the experimental section therefore feels relatively weak. * [FastV] https://arxiv.org/abs/2403.06764 * [LLaVA-PruMerge] https://arxiv.org/abs/2403.15388 * [DivPrune] https://arxiv.org/a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemiconductor Lasers and Optical Devices · Optical Coherence Tomography Applications · Optical Network Technologies