Sketching as a Tool for Understanding and Accelerating Self-attention   for Long Sequences

Yifan Chen; Qi Zeng; Dilek Hakkani-Tur; Di Jin; Heng Ji; Yun Yang

arXiv:2112.05359·cs.LG·December 13, 2021

Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences

Yifan Chen, Qi Zeng, Dilek Hakkani-Tur, Di Jin, Heng Ji, Yun Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Skeinformer, a novel method that uses matrix sketching techniques to accelerate and improve self-attention in transformer models for long sequences, demonstrating superior efficiency and accuracy.

Contribution

It establishes a theoretical framework connecting existing models and proposes Skeinformer, a new approach with three components to enhance self-attention for long sequences.

Findings

01

Outperforms existing methods on LRA benchmark

02

Reduces time and space complexity in self-attention

03

Improves accuracy of matrix approximation in transformers

Abstract

Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules. To address this limitation, Linformer and Informer are proposed to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection respectively. These two models are intrinsically connected, and to understand their connection, we introduce a theoretical framework of matrix sketching. Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention with three carefully designed components: column sampling, adaptive row normalization and pilot sampling reutilization. Experiments on the Long Range Arena (LRA) benchmark demonstrate that our methods outperform alternatives with a consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pkuzengqi/skeinformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Computational Physics and Python Applications · Parallel Computing and Optimization Techniques

MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Dense Connections · Multi-Head Linear Attention · Layer Normalization · Linformer