SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language   Model Itself

Jun Gao; Ziqiang Cao; Wenjie Li

arXiv:2405.17052·cs.CL·September 4, 2024

SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language Model Itself

Jun Gao, Ziqiang Cao, Wenjie Li

PDF

TL;DR

This paper introduces SelfCP, a method that uses the LLM itself to compress long prompts into dense tokens, reducing memory costs and improving inference speed without retraining the model.

Contribution

SelfCP is a novel prompt compression technique that employs a frozen LLM to generate dense token representations, enabling efficient handling of long prompts.

Findings

01

Reduces memory costs by 12× for long prompts

02

Improves inference throughput while maintaining response quality

03

Effective on both English and Chinese benchmarks

Abstract

Long prompt leads to huge hardware costs when using transformer-based Large Language Models (LLMs). Unfortunately, many tasks, such as summarization, inevitably introduce long documents, and the wide application of in-context learning easily makes the prompt length explode. This paper proposes a Self-Compressor (SelfCP), which employs the target LLM itself to compress over-limit prompts into dense vectors while keeping the allowed prompts unmodified. Dense vectors are then projected into dense tokens via a learnable connector to make the same LLM unburden to understand. The connector is supervised-tuned under the language modeling objective of the LLM on relatively long texts selected from publicly accessed datasets, involving an instruction dataset to make SelfCP respond to various prompts, while the target LLM keeps frozen during training. We build the lightweight SelfCP upon 2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.