A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder
Hyun-rae Jo, Dongkun Shin

TL;DR
This paper introduces A2SF, a novel token pruning method for transformer decoders that uses a forgetting factor to fairly evaluate token importance, improving model accuracy and memory efficiency.
Contribution
A2SF is the first to incorporate a forgetting factor into accumulative attention scoring for decoder token pruning, addressing uneven token importance evaluation.
Findings
A2SF improves LLaMA 2 accuracy by up to 7.8% in 1-shot settings.
A2SF enhances OPT and LLaMA models' token selection effectiveness.
The method effectively reduces memory usage while maintaining or improving accuracy.
Abstract
Recently, large language models (LLM) based on transformers are facing memory bottleneck issues due to KV cache, especially in long sequence handling. Previous researches proposed KV cache compression techniques that identify insignificant tokens based on Accumulative Attention Scores and removes their items from KV cache, noting that only few tokens play an important role in attention operations. However, we have observed that the existing Accumulative Attention Score is not suitable for the transformer decoder structure. In the decoder model, the number of times the Attention Score accumulates varies depending on the order of token appearance due to the effect of masking, causing an uneven comparison between tokens. To solve this, we propose Accumulative Attention Score with Forgetting Factor (A2SF) technique, which introduces a Forgetting Factor in the Attention Score accumulation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced machining processes and optimization · Welding Techniques and Residual Stresses
MethodsSoftmax · Attention Is All You Need · OPT · LLaMA
