Better Prompt Compression Without Multi-Layer Perceptrons
Edouardo Honig, Andrew Lizarraga, Zijun Frank Zhang, Ying Nian Wu

TL;DR
This paper introduces the Attention-Only Compressor (AOC), a novel prompt compression method that removes MLP layers from Transformer blocks, achieving higher compression ratios and better prompt regeneration than traditional LoRA-based encoders.
Contribution
The paper demonstrates that prompt compression encoders do not need to replicate the original model's architecture, leading to more efficient and effective compression methods.
Findings
AOC achieves up to 480x compression ratio.
AOC outperforms LoRA-based encoders in prompt regeneration.
Removing MLP layers reduces encoder parameters by 67%.
Abstract
Prompt compression is a promising approach to speeding up language model inference without altering the generative model. Prior works compress prompts into smaller sequences of learned tokens using an encoder that is trained as a LowRank Adaptation (LoRA) of the inference language model. However, we show that the encoder does not need to keep the original language model's architecture to achieve useful compression. We introduce the Attention-Only Compressor (AOC), which learns a prompt compression encoder after removing the multilayer perceptron (MLP) layers in the Transformer blocks of a language model, resulting in an encoder with roughly 67% less parameters compared to the original model. Intriguingly we find that, across a range of compression ratios up to 480x, AOC can better regenerate prompts and outperform a baseline compression encoder that is a LoRA of the inference language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Absolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
