DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

Sangwoo Kwon; Seong Hoon Seo; Jae W. Lee; Yeonhong Park

arXiv:2508.06041·cs.LG·December 9, 2025

DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee, Yeonhong Park

PDF

Open Access

TL;DR

DP-LLM introduces a dynamic layer-wise precision assignment mechanism for on-device large language models, optimizing the trade-off between performance and latency by adapting to input sensitivity during decoding.

Contribution

It proposes a novel dynamic precision assignment method that adjusts layer quantization in real-time based on input sensitivity, enhancing model efficiency.

Findings

01

Outperforms prior methods in performance-latency trade-offs

02

Demonstrates effectiveness across multiple models and benchmarks

03

Achieves better resource utilization through dynamic adaptation

Abstract

How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding steps. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Software System Performance and Reliability