Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

Jungsuk Oh; Hyeseo Jeon; Hyunjune Ji; Kyongmin Kong; Jay-Yoon Lee

arXiv:2605.06105·cs.AI·May 8, 2026

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

Jungsuk Oh, Hyeseo Jeon, Hyunjune Ji, Kyongmin Kong, Jay-Yoon Lee

PDF

TL;DR

SPEED introduces a layer-asymmetric KV-visibility policy that reduces long-context inference costs in decoder-only language models by limiting prompt token KV states to lower layers, maintaining quality while saving memory.

Contribution

The paper proposes a novel phase-asymmetric KV-visibility approach that significantly decreases memory and computational costs for long-context inference without sacrificing model performance.

Findings

01

SPEED reduces active KV memory by 25% at 128K context.

02

Model performance remains comparable with full-depth baseline using only 75% of layers for prefill.

03

Long-context inference costs are decreased while preserving benchmark quality.

Abstract

Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce \emph{Shallow Prefill, dEEp Decode} (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75\% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.