Nystr\"omformer: A Nystr\"om-Based Algorithm for Approximating Self-Attention
Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan,, Glenn Fung, Yin Li, Vikas Singh

TL;DR
Nystr"omformer introduces a scalable self-attention approximation using the Nystr"om method, enabling efficient processing of long sequences in transformers while maintaining competitive performance on standard NLP benchmarks.
Contribution
It adapts the Nystr"om method to approximate self-attention with linear complexity, allowing transformers to handle longer sequences efficiently.
Findings
Performs comparably or better on GLUE and IMDB tasks.
Outperforms other efficient self-attention methods on Long Range Arena.
Achieves $O(n)$ complexity in self-attention approximation.
Abstract
Transformers have emerged as a powerful tool for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is the self-attention mechanism that encodes the influence or dependence of other tokens on each specific token. While beneficial, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences -- a topic being actively studied in the community. To address this limitation, we propose Nystr\"{o}mformer -- a model that exhibits favorable scalability as a function of sequence length. Our idea is based on adapting the Nystr\"{o}m method to approximate standard self-attention with complexity. The scalability of Nystr\"{o}mformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (AI Paper Explained)· youtube
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Ferroelectric and Negative Capacitance Devices
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Nyströmformer · Attention Is All You Need · Dense Connections · Byte Pair Encoding · Softmax · Dropout · Label Smoothing
