Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

Willy Fitra Hendria

arXiv:2605.08913·cs.LG·May 15, 2026

Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

Willy Fitra Hendria

PDF

TL;DR

This paper uncovers unexpected non-monotonic latency behavior in Apple MPS during transformer decoding, driven by KV cache interactions and specific execution regimes, challenging assumptions of predictable latency scaling.

Contribution

It identifies and analyzes the causes of non-monotonic latency spikes in Apple MPS, emphasizing the role of KV cache and execution regimes in long-context inference.

Findings

01

Latency spikes up to 21x observed in MPS during decoding.

02

Anomalies originate mainly during the decode phase, not prefill.

03

KV cache interactions significantly influence these latency behaviors.

Abstract

Autoregressive inference is typically assumed to scale predictably with decoding length, with latency increasing smoothly as generated sequence length grows. In this work, we identify unexpected non-monotonic latency behavior in the Apple MPS backend, where latency changes abruptly across nearby decoding configurations during transformer decoding. Using multiple model families (GPT-2, BLOOM, and OPT), we observe latency spikes of up to 21x within specific decoding-budget intervals, followed by recovery at neighboring configurations. Controlled experiments show that these anomalies originate primarily during the decode phase rather than prefill, are not explained by memory pressure alone, and remain absent on CPU and NVIDIA CUDA backends under identical conditions. We further show that key-value (KV) cache interacts strongly with these pathological execution regimes: KV caching remains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.