Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs
Ryan Synk, Monte Hoover, John Kirchenbauer, Neel Jain, Alex Stein,, Manli Shu, Josue Melendez Sanchez, Ramani Duraiswami, Tom Goldstein

TL;DR
This paper introduces a tunable sparse attention mechanism that enables transformer models to perform inference on extremely long contexts, up to 1 million tokens, using commodity GPUs, with minimal performance loss.
Contribution
It presents a novel sparse attention method that reduces inference costs for long contexts, making large-scale transformer inference feasible on standard hardware.
Findings
Inference on 1 million tokens with 16GB GPU RAM
Attending to less than 2% of tokens retains over 95% of model performance
Significant efficiency gains in long-context transformer inference
Abstract
There is growing demand for performing inference with hundreds of thousands of input tokens on trained transformer models. Inference at this extreme scale demands significant computational resources, hindering the application of transformers at long contexts on commodity (i.e not data center scale) hardware. To address the inference time costs associated with running self-attention based transformer language models on long contexts and enable their adoption on widely available hardware, we propose a tunable mechanism that reduces the cost of the forward pass by attending to only the most relevant tokens at every generation step using a top-k selection mechanism. We showcase the efficiency gains afforded by our method by performing inference on context windows up to 1M tokens using approximately 16GB of GPU RAM. Our experiments reveal that models are capable of handling the sparsity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Advanced Neural Network Applications
