PLD+: Accelerating LLM inference by leveraging Language Model Artifacts

Shwetha Somasundaram; Anirudh Phukan; Apoorv Saxena

arXiv:2412.01447·cs.CL·December 3, 2024

PLD+: Accelerating LLM inference by leveraging Language Model Artifacts

Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena

PDF

Open Access 1 Video

TL;DR

PLD+ introduces novel algorithms that leverage model artifacts to significantly accelerate LLM inference for input-guided tasks without additional tuning or computational resources.

Contribution

The paper presents PLD+, a tuning-free method that exploits inference artifacts to speed up LLMs, outperforming existing approaches on multiple input-guided tasks.

Findings

01

PLD+ outperforms all tuning-free methods in experiments.

02

In the greedy setting, PLD+ surpasses EAGLE on four tasks.

03

Achieves up to 2.31x speedup in inference.

Abstract

To reduce the latency associated with autoretrogressive LLM inference, speculative decoding has emerged as a novel decoding paradigm, where future tokens are drafted and verified in parallel. However, the practical deployment of speculative decoding is hindered by its requirements for additional computational resources and fine-tuning, which limits its out-of-the-box usability. To address these challenges, we present PLD+, a suite of novel algorithms developed to accelerate the inference process of LLMs, particularly for input-guided tasks. These tasks, which include code editing, text editing, summarization, etc., often feature outputs with substantial overlap with their inputs-an attribute PLD+ is designed to exploit. PLD+ also leverages the artifacts (attention and hidden states) generated during inference to accelerate inference speed. We test our approach on five input-guided tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

PLD+: Accelerating LLM Inference by Leveraging Language Model Artifacts· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling