Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

Cornelius Kummer; Lena Jurkschat; Michael F\"arber; Sahar Vahdati

arXiv:2604.02985·cs.IR·April 6, 2026

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

Cornelius Kummer, Lena Jurkschat, Michael F\"arber, Sahar Vahdati

PDF

TL;DR

This paper systematically evaluates prompt compression for large language models, demonstrating its potential to significantly reduce latency and memory usage without sacrificing output quality, depending on specific hardware and prompt conditions.

Contribution

It provides the first large-scale analysis of prompt compression trade-offs, including a practical profiler tool for predicting real-world benefits across different setups.

Findings

01

Up to 18% end-to-end speed-ups with prompt compression.

02

Compression can reduce memory usage enough to offload workloads to less powerful GPUs.

03

The effectiveness of compression depends on prompt length, hardware, and compression ratio.

Abstract

With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30,000 queries across several open-source LLMs and three GPU classes. Our evaluation separates compression overhead from decoding latency while tracking output quality and memory usage.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.