ICPC: In-context Prompt Compression with Faster Inference

Ziyang Yu; Yuyu Liu

arXiv:2501.01625·cs.CL·January 6, 2025

ICPC: In-context Prompt Compression with Faster Inference

Ziyang Yu, Yuyu Liu

PDF

Open Access

TL;DR

ICPC is a scalable prompt compression method for LLMs that adaptively reduces prompt length, improving inference speed and maintaining performance across various NLP tasks.

Contribution

ICPC introduces an adaptive prompt compression technique that minimizes information loss and enhances inference efficiency for large language models.

Findings

01

Effective compression of long prompts across multiple NLP tasks.

02

Improved inference speed with maintained or enhanced performance.

03

Scalable approach suitable for different prompt categories.

Abstract

Despite the recent success of Large Language Models (LLMs), it remains challenging to feed LLMs with long prompts due to the fixed size of LLM inputs. As a remedy, prompt compression becomes a promising solution by removing redundant tokens in the prompt. However, using LLM in the existing works requires additional computation resources and leads to memory overheads. To address it, we propose ICPC (In-context Prompt Compression), a novel and scalable prompt compression method that adaptively reduces the prompt length. The key idea of ICPC is to calculate the probability of each word appearing in the prompt using encoders and calculate information carried by each word through the information function, which effectively reduces the information loss during prompt compression and increases the speed of compression. Empirically, we demonstrate that ICPC can effectively compress long texts of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Parallel Computing and Optimization Techniques · Embedded Systems Design Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings