PLD+: Accelerating LLM inference by leveraging Language Model Artifacts
Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena

TL;DR
PLD+ introduces novel algorithms that leverage model artifacts to significantly accelerate LLM inference for input-guided tasks without additional tuning or computational resources.
Contribution
The paper presents PLD+, a tuning-free method that exploits inference artifacts to speed up LLMs, outperforming existing approaches on multiple input-guided tasks.
Findings
PLD+ outperforms all tuning-free methods in experiments.
In the greedy setting, PLD+ surpasses EAGLE on four tasks.
Achieves up to 2.31x speedup in inference.
Abstract
To reduce the latency associated with autoretrogressive LLM inference, speculative decoding has emerged as a novel decoding paradigm, where future tokens are drafted and verified in parallel. However, the practical deployment of speculative decoding is hindered by its requirements for additional computational resources and fine-tuning, which limits its out-of-the-box usability. To address these challenges, we present PLD+, a suite of novel algorithms developed to accelerate the inference process of LLMs, particularly for input-guided tasks. These tasks, which include code editing, text editing, summarization, etc., often feature outputs with substantial overlap with their inputs-an attribute PLD+ is designed to exploit. PLD+ also leverages the artifacts (attention and hidden states) generated during inference to accelerate inference speed. We test our approach on five input-guided tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
