VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling

Chen Guanzhong

arXiv:2603.04460·cs.LG·March 6, 2026

VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling

Chen Guanzhong

PDF

Open Access

TL;DR

VSPrefill introduces a lightweight, sparse attention mechanism with a vertical-slash pattern that significantly speeds up long-context inference in large language models while maintaining high accuracy.

Contribution

It proposes a novel sparse attention method using vertical-slash patterns and a lightweight indexer, enabling linear complexity without modifying backbone models.

Findings

01

Achieves 98.35% accuracy of full attention

02

Provides a 4.95x speedup at 128k context length

03

Sets a new Pareto frontier in accuracy-efficiency trade-off

Abstract

The quadratic complexity of self-attention during the prefill phase impedes long-context inference in large language models. Existing sparse attention methods face a trade-off among context adaptivity, sampling overhead, and fine-tuning costs. We propose VSPrefill, a mechanism requiring lightweight training that uses the vertical-slash structural pattern in attention distributions. Our compact VSIndexer module predicts context-aware importance scores for vertical columns and slash diagonals from key-value representations augmented with RoPE. This approach constructs sparse masks with linear complexity without modifying the backbone parameters. During inference, an adaptive cumulative-threshold strategy allocates sparsity budgets per layer, while a fused kernel executes attention with on-the-fly index merging. Evaluated on Qwen3-4B-Instruct and LLaMA-3.1-8B-Instruct across the LongBench…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning