PolySketchFormer: Fast Transformers via Sketching Polynomial Kernels

Praneeth Kacham; Vahab Mirrokni; Peilin Zhong

arXiv:2310.01655·cs.LG·March 19, 2024·2 cites

PolySketchFormer: Fast Transformers via Sketching Polynomial Kernels

Praneeth Kacham, Vahab Mirrokni, Peilin Zhong

PDF

Open Access

TL;DR

PolySketchFormer introduces a linear-time Transformer model using polynomial sketching techniques, enabling efficient long-context language modeling with provable guarantees and significant speedups over existing methods.

Contribution

The paper presents a novel polynomial sketching approach for attention, achieving linear-time Transformers without sparsification, and demonstrates its effectiveness on large-scale language modeling tasks.

Findings

01

Achieves 2.5-4x training speedup on long-context models

02

Maintains model quality comparable to standard Transformers

03

Validates approach on synthetic and real-world datasets

Abstract

The quadratic time and memory complexity inherent to self-attention mechanisms, with respect to sequence length, presents a critical computational bottleneck in the training and deployment of large-scale Transformer-based language models. Recent theoretical results indicate the intractability of sub-quadratic softmax attention approximation under reasonable complexity assumptions. This paper addresses this challenge by first demonstrating that polynomial attention with high degree can effectively replace softmax without sacrificing model quality. Next, we develop polynomial sketching techniques from numerical linear algebra to achieve linear-time polynomial attention with approximation guarantees. Crucially, our approach achieves this speedup without requiring the sparsification of attention matrices. We also present a block-based algorithm to apply causal masking efficiently. Combining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dropout · Layer Normalization · Dense Connections · Position-Wise Feed-Forward Layer · Label Smoothing · Cosine Annealing