When Linear Attention Meets Autoregressive Decoding: Towards More   Effective and Efficient Linearized Large Language Models

Haoran You; Yichao Fu; Zheng Wang; Amir Yazdanbakhsh; Yingyan Celine; Lin

arXiv:2406.07368·cs.CL·July 26, 2024

When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

Haoran You, Yichao Fu, Zheng Wang, Amir Yazdanbakhsh, Yingyan Celine, Lin

PDF

Open Access 1 Repo

TL;DR

This paper explores combining linear attention with speculative decoding to improve the efficiency and effectiveness of autoregressive large language models, demonstrating significant reductions in perplexity and faster generation.

Contribution

It introduces an augmentation technique for linear attention compatible with speculative decoding, validated through extensive experiments on multiple models.

Findings

01

Up to 6.67 perplexity reduction on LLaMA

02

Up to 2× speedup in generation

03

Validated across seven linear attention models and five LLMs

Abstract

Autoregressive Large Language Models (LLMs) have achieved impressive performance in language tasks but face two significant bottlenecks: (1) quadratic complexity in the attention module as the number of tokens increases, and (2) limited efficiency due to the sequential processing nature of autoregressive LLMs during generation. While linear attention and speculative decoding offer potential solutions, their applicability and synergistic potential for enhancing autoregressive LLMs remain uncertain. We conduct the first comprehensive study on the efficacy of existing linear attention methods for autoregressive LLMs, integrating them with speculative decoding. We introduce an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs. Extensive experiments and ablation studies involving seven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gatech-eic/linearized-llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsLLaMA