Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM   Inference with Transferable Prompt

Zhaozhuo Xu; Zirui Liu; Beidi Chen; Yuxin Tang; Jue Wang; Kaixiong; Zhou; Xia Hu; Anshumali Shrivastava

arXiv:2305.11186·cs.CL·October 11, 2023·5 cites

Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt

Zhaozhuo Xu, Zirui Liu, Beidi Chen, Yuxin Tang, Jue Wang, Kaixiong, Zhou, Xia Hu, Anshumali Shrivastava

PDF

Open Access

TL;DR

This paper proposes a soft prompt learning approach to enhance the performance of compressed LLMs, enabling them to match uncompressed models on benchmarks and transfer prompts across datasets and compression levels.

Contribution

It introduces a novel soft prompt learning method that improves compressed LLM performance and demonstrates prompt transferability across tasks and compression schemes.

Findings

01

Soft prompts significantly boost compressed LLM accuracy.

02

Learned prompts transfer effectively across datasets and compression levels.

03

Compressed models with prompts match uncompressed model performance.

Abstract

While the numerous parameters in Large Language Models (LLMs) contribute to their superior performance, this massive scale makes them inefficient and memory-hungry. Thus, they are hard to deploy on commodity hardware, such as one single GPU. Given the memory and power constraints of such devices, model compression methods are widely employed to reduce both the model size and inference latency, which essentially trades off model quality in return for improved efficiency. Thus, optimizing this accuracy-efficiency trade-off is crucial for the LLM deployment on commodity hardware. In this paper, we introduce a new perspective to optimize this trade-off by prompting compressed models. Specifically, we first observe that for certain questions, the generation quality of a compressed LLM can be significantly improved by adding carefully designed hard prompts, though this isn't the case for all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsPruning