Random Feature Attention
Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith,, Lingpeng Kong

TL;DR
This paper introduces RFA, a linear-time attention mechanism using random features, enabling efficient processing of long sequences in transformers without sacrificing accuracy, and demonstrating significant speed and memory improvements.
Contribution
RFA provides a novel, efficient attention approximation that can replace softmax attention in transformers, improving scalability and speed especially for long sequences.
Findings
RFA achieves comparable or better accuracy than standard transformers.
RFA decodes twice as fast as vanilla transformers in machine translation.
RFA is effective on long text classification datasets.
Abstract
Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsSoftmax
