SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context   Large Language Models

Hengyu Zhang

arXiv:2406.05678·cs.CL·June 11, 2024

SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models

Hengyu Zhang

PDF

Open Access 1 Repo

TL;DR

SinkLoRA significantly improves the efficiency of long-context large language models by optimizing attention mechanisms and caching, enabling better performance and faster inference for extended sequences.

Contribution

The paper introduces SinkLoRA, a novel attention and caching method that enhances long-context LLM efficiency, surpassing previous approaches like LongLoRA.

Findings

01

Achieves 92% of full attention perplexity improvement after fine-tuning.

02

Develops SF-Attn with segmentation and reassembly for better attention head management.

03

Utilizes H2O KV cache compression for faster inference.

Abstract

Extending the functionality of the Transformer model to accommodate longer sequence lengths has become a critical challenge. This extension is crucial not only for improving tasks such as language translation and long-context processing but also for enabling novel applications like chatbots, code generation, and multimedia content creation. The primary obstacle is the self-attention mechanism, which scales quadratically with sequence length in terms of computation time and memory requirements. LongLoRA proposed shifted sparse attention (S\(^2\)-Attn), effectively enabling context extension and leading to non-trivial computation savings with similar performance to fine-tuning with vanilla attention. However, LongLoRA is still not as efficient as vanilla attention, reaching only 39\% of the perplexity improvement compared to full attention. This inefficiency is due to the cyclic shift…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dexter-gt-86/sinklora
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer