BlindSight: Harnessing Sparsity for Efficient Vision-Language Models
Tharun Adithya Srikrishnan, Deval Shah, Timothy Hein, Ahmed Hasssan, Stephen Youn, Steven K. Reinhardt

TL;DR
BlindSight leverages inherent sparsity in vision-language models' attention mechanisms to significantly speed up multi-image inference without substantial accuracy loss, enabling more efficient processing of large prompts.
Contribution
The paper introduces BlindSight, a novel method that exploits attention sparsity in VLMs to optimize multi-image inference with no runtime overhead.
Findings
Achieves 1.8-3.2x speedup in attention computation.
Maintains 99.22% accuracy on multi-image benchmarks.
Generalizes across multiple VLM architectures.
Abstract
Large vision-language models (VLMs) enable joint processing of text and images. However, incorporating vision data significantly increases the prompt length, resulting in a longer time to first token (TTFT). This bottleneck can be alleviated by leveraging the inherent sparsity in the attention computation. Analyzing these attention patterns in VLMs when processing a series of images, we observe the absence of inter-image attention in a substantial portion of layers. Based on this, we propose BlindSight: an approach to optimize multi-image VLM inference using an input-template-aware attention sparsity mask with no runtime overhead. We utilize a dataset to derive a prompt-agnostic categorization for attention heads: Dense, Sink, Intra-Image, and Intra-Image+Sink. We develop a Triton-based GPU kernel to leverage this sparsity. BlindSight achieves a 1.8-3.2x speedup in the attention…
Peer Reviews
Decision·Submitted to ICLR 2026
BlindSight offers significant inference acceleration for multi-image vision-language models by exploiting attention sparsity without requiring extra training or changes to model architecture. It maintains almost the same accuracy as dense attention, showing an average accuracy degradation of only about 1.15% across major benchmarks.
BlindSight relies on predefined sparsity patterns, so it may not capture context-dependent attention dynamics that could be important for some prompts or tasks. The minimal accuracy drop is measured only on major benchmarks; specific cases or other domains might experience higher accuracy degradation. Integration requires careful attention boundary detection, and underlying model changes (e.g., image tokenization strategy) may affect its effectiveness or compatibility.
1. The paper's categorization of specific sparsity patterns (Intra-Image, Sink) in multi-image VLMs and linking this sparsity to modality boundary tokens (e.g., <image_start>) is a valuable insight. 2. BlindSight is a training-free method, which means it can be readily applied to existing pre-trained models without costly retraining, making it highly practical. 3. The paper goes beyond theoretical analysis by developing a custom Triton GPU kernel, demonstrating a clear path to translating this
1. The paper's claim of an "average accuracy degradation of only 1.15%" is misleading. A closer look at Table 1 reveals significant performance drops on certain benchmarks. For example: On Qwen2.5-VL (32B), the MMIU benchmark drops from 44.67 to 41.49 (an absolute drop of 3.18 points, or ~7.1% relative degradation). On Gemma 3 (12B), the MUIRBench benchmark drops from 50.64 to 46.62 (an absolute drop of 4.02 points, or ~7.9% relative degradation). These are substantial performance hits that cann
- This study empirically characterizes four recurring head-level attention patterns across major VLM families such as Qwen and Gemma, providing insights into modality-aware sparsity in multimodal transformers. - This study presents a Triton-based attention kernel tailored for the proposed method, achieving performance gains in realistic long-context inference tasks.
- I think this work should be compared with previous vision token pruning and token merging methods for VLMs, such as FastV, LLaVA-PruMerge, DivPrune, and DART, in terms of reducing the computational cost of vision tokens. Currently, only a comparison with the original model as the baseline is provided, and the experimental section therefore feels relatively weak. * [FastV] https://arxiv.org/abs/2403.06764 * [LLaVA-PruMerge] https://arxiv.org/abs/2403.15388 * [DivPrune] https://arxiv.org/a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemiconductor Lasers and Optical Devices · Optical Coherence Tomography Applications · Optical Network Technologies
