Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences
Yifan Chen, Qi Zeng, Dilek Hakkani-Tur, Di Jin, Heng Ji, Yun Yang

TL;DR
This paper introduces Skeinformer, a novel method that uses matrix sketching techniques to accelerate and improve self-attention in transformer models for long sequences, demonstrating superior efficiency and accuracy.
Contribution
It establishes a theoretical framework connecting existing models and proposes Skeinformer, a new approach with three components to enhance self-attention for long sequences.
Findings
Outperforms existing methods on LRA benchmark
Reduces time and space complexity in self-attention
Improves accuracy of matrix approximation in transformers
Abstract
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules. To address this limitation, Linformer and Informer are proposed to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection respectively. These two models are intrinsically connected, and to understand their connection, we introduce a theoretical framework of matrix sketching. Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention with three carefully designed components: column sampling, adaptive row normalization and pilot sampling reutilization. Experiments on the Long Range Arena (LRA) benchmark demonstrate that our methods outperform alternatives with a consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Computational Physics and Python Applications · Parallel Computing and Optimization Techniques
MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Dense Connections · Multi-Head Linear Attention · Layer Normalization · Linformer
