PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction
Runsong Zhao, Shilei Liu, Jiwei Tang, Langming Liu, Haibin Chen, Weidong Zhang, Yujin Yuan, Tong Xiao, Jingbo Zhu, Wenbo Su, Bo Zheng

TL;DR
This paper introduces Performance-oriented Context Compression (PoC) for LLMs, which optimizes context reduction based on a specified performance threshold, improving reliability and efficiency over traditional ratio-based methods.
Contribution
The paper proposes a novel performance-aware compression framework with a lightweight predictor, including context-aware variants, to better balance compression and performance in LLM deployment.
Findings
Context-aware predictor reduces prediction error.
PoC achieves better overall performance.
Improves reliability of context compression.
Abstract
While context compression can mitigate the growing inference costs of Large Language Models (LLMs) by shortening contexts, existing methods that specify a target compression ratio or length suffer from unpredictable performance degradation, hindering their reliable deployment. We introduce a paradigm shift to Performance-oriented Context Compression (PoC), where developers specify an acceptable performance floor instead of a compression ratio. PoC employs a lightweight performance predictor to automatically find the most aggressive compression ratio that satisfies this constraint before steering an off-the-shelf compressor. We design and compare two predictor variants: a simple context-agnostic predictor and a more sophisticated context-aware one that considers the input's inherent compressibility. On both question-answering and summarization benchmarks, the context-aware predictor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Natural Language Processing Techniques · Big Data and Digital Economy
