ProCut: LLM Prompt Compression via Attribution Estimation
Zhentao Xu, Fengyi Li, Albert Chen, Xiaofeng Wang

TL;DR
ProCut is a prompt compression method for large language models that uses attribution analysis to reduce prompt size significantly while maintaining or improving task performance, thereby reducing costs and complexity.
Contribution
ProCut introduces a training-free, attribution-based prompt compression framework that is LLM-agnostic and effective across multiple benchmarks and real-world prompts.
Findings
Achieves 78% token reduction in production prompts.
Maintains or improves task performance up to 62%.
Reduces compression latency by over 50%.
Abstract
In large-scale industrial LLM systems, prompt templates often expand to thousands of tokens as teams iteratively incorporate sections such as task instructions, few-shot examples, and heuristic rules to enhance robustness and coverage. This expansion leads to bloated prompts that are difficult to maintain and incur significant inference latency and serving costs. To address this, we introduce Prompt Compression via Attribution Estimation (ProCut), a flexible, LLM-agnostic, training-free framework that compresses prompts through attribution analysis. ProCut segments prompt templates into semantically meaningful units, quantifies their impact on task performance, and prunes low-utility components. Through extensive experiments on five public benchmark datasets and real-world industrial prompts, we show that ProCut achieves substantial prompt size reductions (78% fewer tokens in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSoftware System Performance and Reliability · Data Quality and Management · Topic Modeling
