Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

Xingyu Xie; Zhaochen Yu; Yue Liao; Tao Wang; Kim-Chuan Toh; Shuicheng Yan

arXiv:2603.12038·cs.LG·March 13, 2026

Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh, Shuicheng Yan

PDF

Open Access

TL;DR

This paper introduces Slow-Fast Inference, a training-free decoding method that accelerates long-context autoregressive decoding by leveraging within-sentence support stability, achieving significant speedups without sacrificing quality.

Contribution

The paper proposes a novel training-free decoding framework that decouples inference into fast and slow steps based on semantic boundaries, improving efficiency in long-context models.

Findings

01

Achieves 1.6x to 14.4x higher decoding throughput.

02

Maintains comparable quality to full-KV baselines.

03

Applicable directly to existing models without retraining.

Abstract

Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6 \times$ -- $14.4 \times$ higher decoding throughput while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques