QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization

Danush Khanna; Aditya Kumar Guru; Srivarshinee Sridhar; Zidan Ahmed; Rubhav Bahirwani; Meetu Malhotra; Vinija Jain; Aman Chadha; Amitava Das; Kripabandhu Ghosh

arXiv:2506.22396·cs.CL·June 30, 2025

QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization

Danush Khanna, Aditya Kumar Guru, Srivarshinee Sridhar, Zidan Ahmed, Rubhav Bahirwani, Meetu Malhotra, Vinija Jain, Aman Chadha, Amitava Das, Kripabandhu Ghosh

PDF

Open Access

TL;DR

QuickSilver is a modular inference framework for large language models that reduces latency and energy consumption by dynamically halting, skipping, and fusing tokens during decoding without retraining or model modifications.

Contribution

It introduces four novel, synergistic mechanisms for runtime inference optimization that operate on frozen models, enabling significant efficiency gains without retraining.

Findings

01

Up to 39.6% FLOP reduction on GPT-2 and Llama-2

02

Negligible perplexity degradation (<=0.2)

03

Operates without model retraining or architecture changes

Abstract

Inference accounts for the majority of latency and energy consumption in large language model (LLM) deployments, often exceeding 90% of total cost. While training-time efficiency has seen extensive progress, runtime optimization remains a key bottleneck, particularly under autoregressive decoding. Existing approaches -- such as pruning, quantization, early exits, and speculative decoding -- often require retraining, architectural changes, or disrupt decoding compatibility. We introduce QuickSilver, a modular, token-level framework that enables semantic adaptivity at inference time without altering model weights or structure. QuickSilver integrates four synergistic mechanisms: (i) Dynamic Token Halting, which halts computation for tokens with converged representations; (ii) KV Cache Skipping, which selectively suppresses memory writes to reduce attention overhead; and (iii) Contextual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Materials Science