ICPC: In-context Prompt Compression with Faster Inference
Ziyang Yu, Yuyu Liu

TL;DR
ICPC is a scalable prompt compression method for LLMs that adaptively reduces prompt length, improving inference speed and maintaining performance across various NLP tasks.
Contribution
ICPC introduces an adaptive prompt compression technique that minimizes information loss and enhances inference efficiency for large language models.
Findings
Effective compression of long prompts across multiple NLP tasks.
Improved inference speed with maintained or enhanced performance.
Scalable approach suitable for different prompt categories.
Abstract
Despite the recent success of Large Language Models (LLMs), it remains challenging to feed LLMs with long prompts due to the fixed size of LLM inputs. As a remedy, prompt compression becomes a promising solution by removing redundant tokens in the prompt. However, using LLM in the existing works requires additional computation resources and leads to memory overheads. To address it, we propose ICPC (In-context Prompt Compression), a novel and scalable prompt compression method that adaptively reduces the prompt length. The key idea of ICPC is to calculate the probability of each word appearing in the prompt using encoders and calculate information carried by each word through the information function, which effectively reduces the information loss during prompt compression and increases the speed of compression. Empirically, we demonstrate that ICPC can effectively compress long texts of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Parallel Computing and Optimization Techniques · Embedded Systems Design Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
