InAttention: Linear Context Scaling for Transformers
Joseph Eisner

TL;DR
This paper introduces InAttention, a linear context scaling method for transformers that reduces VRAM usage during inference, enabling efficient processing of long sequences on consumer GPUs and improving long-range dependency modeling.
Contribution
The paper proposes InAttention, a novel modification to transformer architecture that scales linearly with context length, significantly reducing VRAM requirements during inference.
Findings
InAttention reduces VRAM usage during inference.
Fine-tuning with InAttention extends context length effectively.
InAttention enables handling of longer sequences on consumer GPUs.
Abstract
VRAM requirements for transformer models scale quadratically with context length due to the self-attention mechanism. In this paper we modify the decoder-only transformer, replacing self-attention with InAttention, which scales linearly with context length during inference by having tokens attend only to initial states. Benchmarking shows that InAttention significantly reduces VRAM usage during inference, enabling handling of long sequences on consumer GPUs. We corroborate that fine-tuning extends context length efficiently, improving performance on long sequences without high training costs. InAttention offers a scalable solution for long-range dependencies in transformer models, paving the way for further optimization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Image Processing and 3D Reconstruction
