SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language Model Itself
Jun Gao, Ziqiang Cao, Wenjie Li

TL;DR
This paper introduces SelfCP, a method that uses the LLM itself to compress long prompts into dense tokens, reducing memory costs and improving inference speed without retraining the model.
Contribution
SelfCP is a novel prompt compression technique that employs a frozen LLM to generate dense token representations, enabling efficient handling of long prompts.
Findings
Reduces memory costs by 12× for long prompts
Improves inference throughput while maintaining response quality
Effective on both English and Chinese benchmarks
Abstract
Long prompt leads to huge hardware costs when using transformer-based Large Language Models (LLMs). Unfortunately, many tasks, such as summarization, inevitably introduce long documents, and the wide application of in-context learning easily makes the prompt length explode. This paper proposes a Self-Compressor (SelfCP), which employs the target LLM itself to compress over-limit prompts into dense vectors while keeping the allowed prompts unmodified. Dense vectors are then projected into dense tokens via a learnable connector to make the same LLM unburden to understand. The connector is supervised-tuned under the language modeling objective of the LLM on relatively long texts selected from publicly accessed datasets, involving an instruction dataset to make SelfCP respond to various prompts, while the target LLM keeps frozen during training. We build the lightweight SelfCP upon 2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
