Prompt-based Depth Pruning of Large Language Models
Juyun Wee, Minjae Park, Jaeho Lee

TL;DR
This paper introduces PuDDing, a prompt-based dynamic depth pruning method for large language models that selectively omits transformer blocks based on input prompts, improving inference efficiency and task performance.
Contribution
The paper proposes a novel dynamic depth pruning algorithm that adapts transformer block removal to specific inputs, outperforming static pruning methods.
Findings
PuDDing accelerates inference in language models.
It achieves better task performance than static pruning.
Effective on commonsense reasoning benchmarks.
Abstract
Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent -- a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsPruning · Sparse Evolutionary Training
