LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference

Hossein Rajabzadeh; Maryam Dialameh; Chul B. Park; Il-Min Kim; Hyock Ju Kwon

arXiv:2601.02569·cs.CL·January 7, 2026

LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference

Hossein Rajabzadeh, Maryam Dialameh, Chul B. Park, Il-Min Kim, Hyock Ju Kwon

PDF

Open Access

TL;DR

LoRA-Drop introduces a novel inference method for LLMs that accelerates decoding by selectively reusing intermediate layer states and periodically refreshing, achieving significant speedups and cache reductions with minimal accuracy loss.

Contribution

This paper proposes LoRA-Drop, a plug-and-play inference framework that improves LLM decoding efficiency without auxiliary routing or accuracy degradation, using a temporal compute schedule and low-rank LoRA corrections.

Findings

01

Up to 2.6× faster decoding across multiple models.

02

45-55% reduction in KV-cache size.

03

Maintains within 0.5pp of baseline accuracy.

Abstract

Autoregressive large language models (LLMs) are bottlenecked by sequential decoding, where each new token typically requires executing all transformer layers. Existing dynamic-depth and layer-skipping methods reduce this cost, but often rely on auxiliary routing mechanisms or incur accuracy degradation when bypassed layers are left uncompensated. We present \textbf{LoRA-Drop}, a plug-and-play inference framework that accelerates decoding by applying a \emph{temporal compute schedule} to a fixed subset of intermediate layers: on most decoding steps, selected layers reuse the previous-token hidden state and apply a low-rank LoRA correction, while periodic \emph{refresh} steps execute the full model to prevent drift. LoRA-Drop requires no routing network, is compatible with standard KV caching, and can reduce KV-cache footprint by skipping KV updates in droppable layers during LoRA steps…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Multimodal Machine Learning Applications