PolySketchFormer: Fast Transformers via Sketching Polynomial Kernels
Praneeth Kacham, Vahab Mirrokni, Peilin Zhong

TL;DR
PolySketchFormer introduces a linear-time Transformer model using polynomial sketching techniques, enabling efficient long-context language modeling with provable guarantees and significant speedups over existing methods.
Contribution
The paper presents a novel polynomial sketching approach for attention, achieving linear-time Transformers without sparsification, and demonstrates its effectiveness on large-scale language modeling tasks.
Findings
Achieves 2.5-4x training speedup on long-context models
Maintains model quality comparable to standard Transformers
Validates approach on synthetic and real-world datasets
Abstract
The quadratic time and memory complexity inherent to self-attention mechanisms, with respect to sequence length, presents a critical computational bottleneck in the training and deployment of large-scale Transformer-based language models. Recent theoretical results indicate the intractability of sub-quadratic softmax attention approximation under reasonable complexity assumptions. This paper addresses this challenge by first demonstrating that polynomial attention with high degree can effectively replace softmax without sacrificing model quality. Next, we develop polynomial sketching techniques from numerical linear algebra to achieve linear-time polynomial attention with approximation guarantees. Crucially, our approach achieves this speedup without requiring the sparsification of attention matrices. We also present a block-based algorithm to apply causal masking efficiently. Combining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Generative Adversarial Networks and Image Synthesis
MethodsAttention Is All You Need · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dropout · Layer Normalization · Dense Connections · Position-Wise Feed-Forward Layer · Label Smoothing · Cosine Annealing
