DASH: Input-Aware Dynamic Layer Skipping for Efficient LLM Inference with Markov Decision Policies

Ning Yang; Fangxin Liu; Junjie Wang; Tao Yang; Kan Liu; Haibing Guan; Li Jiang

arXiv:2505.17420·cs.CL·May 26, 2025

DASH: Input-Aware Dynamic Layer Skipping for Efficient LLM Inference with Markov Decision Policies

Ning Yang, Fangxin Liu, Junjie Wang, Tao Yang, Kan Liu, Haibing Guan, Li Jiang

PDF

TL;DR

DASH is an input-aware, dynamic layer-skipping framework for large language models that reduces inference costs by making token-level decisions using Markov Decision Processes, with mechanisms to maintain performance.

Contribution

We introduce DASH, a novel adaptive layer-skipping approach for LLMs that models skipping as an MDP and employs a lightweight compensation mechanism to preserve accuracy.

Findings

01

Significant inference speedup on multiple LLM architectures.

02

Maintains competitive performance across NLP benchmarks.

03

Outperforms existing layer-skipping methods.

Abstract

Large language models (LLMs) have achieved remarkable performance across a wide range of NLP tasks. However, their substantial inference cost poses a major barrier to real-world deployment, especially in latency-sensitive scenarios. To address this challenge, we propose \textbf{DASH}, an adaptive layer-skipping framework that dynamically selects computation paths conditioned on input characteristics. We model the skipping process as a Markov Decision Process (MDP), enabling fine-grained token-level decisions based on intermediate representations. To mitigate potential performance degradation caused by skipping, we introduce a lightweight compensation mechanism that injects differential rewards into the decision process. Furthermore, we design an asynchronous execution strategy that overlaps layer computation with policy evaluation to minimize runtime overhead. Experiments on multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.