A2SF: Accumulative Attention Scoring with Forgetting Factor for Token   Pruning in Transformer Decoder

Hyun-rae Jo; Dongkun Shin

arXiv:2407.20485·cs.CL·August 1, 2024·1 cites

A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder

Hyun-rae Jo, Dongkun Shin

PDF

Open Access

TL;DR

This paper introduces A2SF, a novel token pruning method for transformer decoders that uses a forgetting factor to fairly evaluate token importance, improving model accuracy and memory efficiency.

Contribution

A2SF is the first to incorporate a forgetting factor into accumulative attention scoring for decoder token pruning, addressing uneven token importance evaluation.

Findings

01

A2SF improves LLaMA 2 accuracy by up to 7.8% in 1-shot settings.

02

A2SF enhances OPT and LLaMA models' token selection effectiveness.

03

The method effectively reduces memory usage while maintaining or improving accuracy.

Abstract

Recently, large language models (LLM) based on transformers are facing memory bottleneck issues due to KV cache, especially in long sequence handling. Previous researches proposed KV cache compression techniques that identify insignificant tokens based on Accumulative Attention Scores and removes their items from KV cache, noting that only few tokens play an important role in attention operations. However, we have observed that the existing Accumulative Attention Score is not suitable for the transformer decoder structure. In the decoder model, the number of times the Attention Score accumulates varies depending on the order of token appearance due to the effect of masking, causing an uneven comparison between tokens. To solve this, we propose Accumulative Attention Score with Forgetting Factor (A2SF) technique, which introduces a Forgetting Factor in the Attention Score accumulation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced machining processes and optimization · Welding Techniques and Residual Stresses

MethodsSoftmax · Attention Is All You Need · OPT · LLaMA