LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue, Zhang, Qingwei Lin, Victor R\"uhle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao,, Lili Qiu, Dongmei Zhang

TL;DR
This paper introduces a data distillation-based prompt compression method that enhances efficiency and faithfulness across various tasks and models by leveraging a token classification approach with a Transformer encoder.
Contribution
It proposes a novel data distillation technique for task-agnostic prompt compression, addressing limitations of entropy-based methods and improving generalization and speed.
Findings
Significant performance improvements over baselines.
Robust generalization across different LLMs.
Achieves 3x-6x faster inference with 2x-5x compression ratios.
Abstract
This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective. To address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsParallel Computing and Optimization Techniques · Algorithms and Data Compression · Computational Physics and Python Applications
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Softmax · Layer Normalization · Multi-Head Attention · Dropout · Residual Connection · Position-Wise Feed-Forward Layer · Byte Pair Encoding
