Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

Yiming Bian; Joshua M. Akey

arXiv:2604.20819·cs.LG·April 23, 2026

Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

Yiming Bian, Joshua M. Akey

PDF

TL;DR

Stream-CQSA introduces a memory-adaptive scheduling framework that enables exact attention computation over billion-token sequences on limited hardware by decomposing attention into independent, schedulable subproblems.

Contribution

The paper presents CQS Divide and Stream-CQSA, novel methods that decompose attention to enable out-of-memory avoidance without approximation or hardware assumptions.

Findings

01

Exact attention over billion-token sequences achieved on a single GPU.

02

Memory scaling is predictable and flexible with the proposed framework.

03

No approximation error introduced in the attention computation.

Abstract

The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. Existing methods improve memory efficiency to near-linear complexity, while assuming that the full query, key, and value tensors fit in device memory. In this work, we remove this assumption by introducing CQS Divide, an operation derived from cyclic quorum sets (CQS) theory that decomposes attention into a set of independent subsequence computations whose recomposition yields exactly the same result as full-sequence attention. Exploiting this decomposition, we introduce Stream-CQSA, a memory-adaptive scheduling framework that partitions attention into subproblems that fit within arbitrary memory budgets. This recasts attention from a logically monolithic operation into a collection of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.