VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling
Chen Guanzhong

TL;DR
VSPrefill introduces a lightweight, sparse attention mechanism with a vertical-slash pattern that significantly speeds up long-context inference in large language models while maintaining high accuracy.
Contribution
It proposes a novel sparse attention method using vertical-slash patterns and a lightweight indexer, enabling linear complexity without modifying backbone models.
Findings
Achieves 98.35% accuracy of full attention
Provides a 4.95x speedup at 128k context length
Sets a new Pareto frontier in accuracy-efficiency trade-off
Abstract
The quadratic complexity of self-attention during the prefill phase impedes long-context inference in large language models. Existing sparse attention methods face a trade-off among context adaptivity, sampling overhead, and fine-tuning costs. We propose VSPrefill, a mechanism requiring lightweight training that uses the vertical-slash structural pattern in attention distributions. Our compact VSIndexer module predicts context-aware importance scores for vertical columns and slash diagonals from key-value representations augmented with RoPE. This approach constructs sparse masks with linear complexity without modifying the backbone parameters. During inference, an adaptive cumulative-threshold strategy allocates sparsity budgets per layer, while a fused kernel executes attention with on-the-fly index merging. Evaluated on Qwen3-4B-Instruct and LLaMA-3.1-8B-Instruct across the LongBench…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
