InAttention: Linear Context Scaling for Transformers

Joseph Eisner

arXiv:2410.07063·cs.LG·October 10, 2024

InAttention: Linear Context Scaling for Transformers

Joseph Eisner

PDF

Open Access

TL;DR

This paper introduces InAttention, a linear context scaling method for transformers that reduces VRAM usage during inference, enabling efficient processing of long sequences on consumer GPUs and improving long-range dependency modeling.

Contribution

The paper proposes InAttention, a novel modification to transformer architecture that scales linearly with context length, significantly reducing VRAM requirements during inference.

Findings

01

InAttention reduces VRAM usage during inference.

02

Fine-tuning with InAttention extends context length effectively.

03

InAttention enables handling of longer sequences on consumer GPUs.

Abstract

VRAM requirements for transformer models scale quadratically with context length due to the self-attention mechanism. In this paper we modify the decoder-only transformer, replacing self-attention with InAttention, which scales linearly with context length during inference by having tokens attend only to initial states. Benchmarking shows that InAttention significantly reduces VRAM usage during inference, enabling handling of long sequences on consumer GPUs. We corroborate that fine-tuning extends context length efficiently, improving performance on long sequences without high training costs. InAttention offers a scalable solution for long-range dependencies in transformer models, paving the way for further optimization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Image Processing and 3D Reconstruction