Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
Willy Fitra Hendria

TL;DR
This paper uncovers unexpected non-monotonic latency behavior in Apple MPS during transformer decoding, driven by KV cache interactions and specific execution regimes, challenging assumptions of predictable latency scaling.
Contribution
It identifies and analyzes the causes of non-monotonic latency spikes in Apple MPS, emphasizing the role of KV cache and execution regimes in long-context inference.
Findings
Latency spikes up to 21x observed in MPS during decoding.
Anomalies originate mainly during the decode phase, not prefill.
KV cache interactions significantly influence these latency behaviors.
Abstract
Autoregressive inference is typically assumed to scale predictably with decoding length, with latency increasing smoothly as generated sequence length grows. In this work, we identify unexpected non-monotonic latency behavior in the Apple MPS backend, where latency changes abruptly across nearby decoding configurations during transformer decoding. Using multiple model families (GPT-2, BLOOM, and OPT), we observe latency spikes of up to 21x within specific decoding-budget intervals, followed by recovery at neighboring configurations. Controlled experiments show that these anomalies originate primarily during the decode phase rather than prefill, are not explained by memory pressure alone, and remain absent on CPU and NVIDIA CUDA backends under identical conditions. We further show that key-value (KV) cache interacts strongly with these pathological execution regimes: KV caching remains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
