SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator
Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, Chao Huang

TL;DR
SepLLM is a novel framework that accelerates large language models by compressing segments between separator tokens, significantly reducing computation and memory usage without sacrificing performance.
Contribution
The paper introduces SepLLM, a plug-and-play method that condenses segment information into separator tokens, enabling faster inference and training for large language models.
Findings
Over 50% reduction in KV cache on GSM8K-CoT benchmark.
Maintains comparable performance with standard models.
Effective in processing sequences of 4 million tokens or more.
Abstract
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Gausson/pythia-160m-deduped-SepLLMmodel· 3 dl3 dl
- 🤗Gausson/pythia-160m-deduped-n64-SepLLMmodel· 3 dl3 dl
- 🤗Gausson/pythia-160m-deduped-n128-SepLLMmodel· 4 dl4 dl
- 🤗Gausson/pythia-160m-deduped-n64ht-SepLLMmodel· 3 dl3 dl
- 🤗Gausson/pythia-160m-deduped-n64h-SepLLMmodel· 4 dl4 dl
- 🤗Gausson/pythia-160m-deduped-n64-RoBiPE-SepLLMmodel· 2 dl2 dl
- 🤗Gausson/gpt-neox-125m-deduped-SAmodel
- 🤗Gausson/pythia-160m-deduped-n64-StreamingLLMmodel· 3 dl3 dl
- 🤗transformers-community/sep_cachemodel· 4 dl· ♡ 94 dl♡ 9
- 🤗Gausson/sep_cachemodel· 3 dl· ♡ 13 dl♡ 1
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need
