Discovering the Gems in Early Layers: Accelerating Long-Context LLMs   with 1000x Input Token Reduction

Zhenmei Shi; Yifei Ming; Xuan-Phi Nguyen; Yingyu Liang; Shafiq Joty

arXiv:2409.17422·cs.CL·September 27, 2024

Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty

PDF

Open Access 1 Repo

TL;DR

This paper introduces GemFilter, a training-free method that accelerates long-context LLM inference by early layer token filtering, reducing input size and memory usage while maintaining performance.

Contribution

The paper presents a novel, training-free token filtering algorithm using early LLM layers to significantly speed up inference and reduce memory consumption for long-context models.

Findings

01

Achieves 2.4x speedup over SOTA methods

02

Reduces GPU memory usage by 30%

03

Maintains performance on LongBench challenge

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency. Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption. Our research demonstrates that LLMs can identify relevant tokens in the early layers before generating answers to a query. Leveraging this insight, we propose an algorithm that uses early layers of an LLM as filters to select and compress input tokens, significantly reducing the context length for subsequent processing. Our method, GemFilter, demonstrates substantial improvements in both speed and memory efficiency compared to existing techniques, such as standard attention and SnapKV/H2O. Notably, it achieves a 2.4 $\times$ speedup and 30\% reduction in GPU memory usage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

salesforceairesearch/gemfilter
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvancements in Photolithography Techniques · Advanced Data Storage Technologies · VLSI and Analog Circuit Testing

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings