SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

Jonathan Li; Nasim Farahini; Evgenii Iuliugin; Magnus Vesterlund; Christian H\"aggstr\"om; Guangtao Wang; Shubhangi Upasani; Ayush Sachdeva; Rui Li; Faline Fu; Chen Wu; Ayesha Siddiqua; John Long; Tuowen Zhao; Matheen Musaddiq; H\r{a}kan Zeffer; Yun Du; Mingran Wang; Qinghua Li; Bo Li; Urmish Thakker; Raghu Prabhakar

arXiv:2511.03092·cs.AI·April 10, 2026

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian H\"aggstr\"om, Guangtao Wang, Shubhangi Upasani, Ayush Sachdeva, Rui Li, Faline Fu, Chen Wu, Ayesha Siddiqua, John Long, Tuowen Zhao, Matheen Musaddiq, H\r{a}kan Zeffer, Yun Du, Mingran Wang, Qinghua Li

PDF

TL;DR

This paper introduces SnapStream, a novel KV cache compression technique enabling efficient long sequence decoding on dataflow accelerators with minimal accuracy loss, demonstrated on large language models in production.

Contribution

Develops SnapStream, the first sparse KV attention method deployed in production inference systems with static graphs and continuous batching, improving memory efficiency and scalability.

Findings

01

SnapStream achieves 4x on-chip memory savings.

02

Minimal accuracy degradation on LongBench-v2, AIME24, and LiveCodeBench.

03

Demonstrated in a 16-way tensor-parallel deployment on SambaNova accelerators.

Abstract

The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.