Compression Method Matters: Benchmark-Dependent Output Dynamics in LLM Prompt Compression
Warren Johnson

TL;DR
This paper investigates how prompt compression affects output length and inference costs in large language models, revealing benchmark-dependent dynamics and proposing metrics for more reliable evaluation.
Contribution
It introduces the instruction survival probability (Psi) and the Compression Robustness Index (CRI), providing new tools to assess compression effects across different benchmarks.
Findings
Output expansion varies significantly across benchmarks.
Prompt structure, not provider identity, moderates compression effects.
Token savings may overstate actual energy savings.
Abstract
Prompt compression is often evaluated by input-token reduction, but its real deployment impact depends on how compression changes output length and total inference cost. We present a controlled replication and extension study of benchmark-dependent output dynamics under aggressive compression, covering 5,400 API calls across three benchmarks and multiple providers. To explain conflicting prior observations, we formalize instruction survival probability (Psi), a structural metric that captures whether task-critical prompt segments remain after truncation. Results show a strong benchmark effect: under r=0.3, DeepSeek exhibits severe output expansion on MBPP (56x, Psi approx 0.15) but substantially lower expansion on HumanEval (5x, Psi approx 0.72), while GPT-4o-mini is comparatively stable across benchmarks. This reconciles the apparent discrepancy between previously reported extreme…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Security and Verification in Computing · Green IT and Sustainability
