SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

Guoxuan Chen; Han Shi; Jiawei Li; Yihang Gao; Xiaozhe Ren; Yimeng Chen; Xin Jiang; Zhenguo Li; Weiyang Liu; Chao Huang

arXiv:2412.12094·cs.CL·June 3, 2025

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, Chao Huang

PDF

Open Access 1 Repo 10 Models 1 Video

TL;DR

SepLLM is a novel framework that accelerates large language models by compressing segments between separator tokens, significantly reducing computation and memory usage without sacrificing performance.

Contribution

The paper introduces SepLLM, a plug-and-play method that condenses segment information into separator tokens, enabling faster inference and training for large language models.

Findings

01

Over 50% reduction in KV cache on GSM8K-CoT benchmark.

02

Maintains comparable performance with standard models.

03

Effective in processing sequences of 4 million tokens or more.

Abstract

Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HKUDS/SepLLM
pytorchOfficial

Models

Videos

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator· slideslive

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need