Efficient Prompt Compression with Evaluator Heads for Long-Context   Transformer Inference

Weizhi Fei; Xueyan Niu; Guoqing Xie; Yingqing Liu; Bo Bai; Wei Han

arXiv:2501.12959·cs.CL·February 6, 2025

Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference

Weizhi Fei, Xueyan Niu, Guoqing Xie, Yingqing Liu, Bo Bai, Wei Han

PDF

Open Access

TL;DR

This paper introduces EHPC, a prompt compression technique that uses evaluator heads in transformers to efficiently select key tokens, reducing computational costs and maintaining performance in long-context LLM tasks.

Contribution

The paper presents a novel, training-free prompt compression method leveraging evaluator heads to improve long-context inference efficiency in LLMs.

Findings

01

EHPC achieves state-of-the-art results on prompt compression benchmarks.

02

It significantly reduces inference costs and complexity.

03

EHPC performs competitively with key-value cache methods.

Abstract

Although applications involving long-context inputs are crucial for the effective utilization of large language models (LLMs), they also result in increased computational costs and reduced performance. To address this challenge, we propose an efficient, training-free prompt compression method that retains key information within compressed prompts. We identify specific attention heads in transformer-based LLMs, which we designate as evaluator heads, that are capable of selecting tokens in long inputs that are most significant for inference. Building on this discovery, we develop EHPC, an Evaluator Head-based Prompt Compression method, which enables LLMs to rapidly "skim through" input prompts by leveraging only the first few layers with evaluator heads during the pre-filling stage, subsequently passing only the important tokens to the model for inference. EHPC achieves state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Image and Signal Denoising Methods · Fault Detection and Control Systems

MethodsSoftmax · Attention Is All You Need