LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference
Hossein Rajabzadeh, Maryam Dialameh, Chul B. Park, Il-Min Kim, Hyock Ju Kwon

TL;DR
LoRA-Drop introduces a novel inference method for LLMs that accelerates decoding by selectively reusing intermediate layer states and periodically refreshing, achieving significant speedups and cache reductions with minimal accuracy loss.
Contribution
This paper proposes LoRA-Drop, a plug-and-play inference framework that improves LLM decoding efficiency without auxiliary routing or accuracy degradation, using a temporal compute schedule and low-rank LoRA corrections.
Findings
Up to 2.6× faster decoding across multiple models.
45-55% reduction in KV-cache size.
Maintains within 0.5pp of baseline accuracy.
Abstract
Autoregressive large language models (LLMs) are bottlenecked by sequential decoding, where each new token typically requires executing all transformer layers. Existing dynamic-depth and layer-skipping methods reduce this cost, but often rely on auxiliary routing mechanisms or incur accuracy degradation when bypassed layers are left uncompensated. We present \textbf{LoRA-Drop}, a plug-and-play inference framework that accelerates decoding by applying a \emph{temporal compute schedule} to a fixed subset of intermediate layers: on most decoding steps, selected layers reuse the previous-token hidden state and apply a low-rank LoRA correction, while periodic \emph{refresh} steps execute the full model to prevent drift. LoRA-Drop requires no routing network, is compatible with standard KV caching, and can reduce KV-cache footprint by skipping KV updates in droppable layers during LoRA steps…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Multimodal Machine Learning Applications
