Spectraformer: A Unified Random Feature Framework for Transformer
Duke Nguyen, Du Yin, Aditya Joshi, Flora Salim

TL;DR
Spectraformer introduces a unified random feature framework for Transformer attention, achieving state-of-the-art performance among random feature methods and offering various trade-offs in efficiency and accuracy.
Contribution
We present Spectraformer, a systematic framework that unifies and advances random feature-based attention approximation in Transformers, setting new performance benchmarks.
Findings
Achieves performance comparable to top sparse and low-rank methods on Long Range Arena.
Establishes a new state-of-the-art for random feature-based Transformers.
Offers multiple variants with different accuracy, training time, and memory trade-offs.
Abstract
Linearization of attention using various kernel approximation and kernel learning techniques has shown promise. Past methods used a subset of combinations of component functions and weight matrices within the random feature paradigm. We identify the need for a systematic comparison of different combinations of weight matrices and component functions for attention learning in Transformer. Hence, we introduce Spectraformer, a unified framework for approximating and learning the kernel function in the attention mechanism of the Transformer. Our empirical results demonstrate, for the first time, that a random feature-based approach can achieve performance comparable to top-performing sparse and low-rank methods on the challenging Long Range Arena benchmark. Thus, we establish a new state-of-the-art for random feature-based efficient Transformers. The framework also produces many variants…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
-- The paper’s proposed unification exposes gaps which can be filled with novel combination of linear kernel methods -- The paper is well written and easy to follow -- On LRA benchmark the novel combinations show some promise
-- Novelty: The formulation is somewhat of a repeat of chowdhury et al 2022’s formulation. Even though it is more complete with more component functions and weights. -- Empirical evaluation is insufficient. LRA dataset on its own is not sufficient to evaluate which combination works best. The LRA benchmark is old and doesn’t satisfy the the current requirements. IMHO for this paper to pass the acceptance threshold, I would want a much more thorough evaluation, on multiple benchmark dataset and
The research on linear low-rank attention methods for Transformers is important for several practical reasons (fast inference, e.g. for on-device-deployment, etc.) and this paper aims to improve the existing methods in the field. The presented extension is sound and the idea to represent projections as learnable vectors rather than vectors sampled from a fixed probabilistic distribution is a neat idea. The experimental section presents a comprehensive comparison with several related methods.
The conclusions are stated pretty vaguely, we read: " Our empirical findings indicate that different kernels are good at different tasks and that kernel choice is fundamental to performant models". This is not a particularly informative statement. Learning the projections rather than taking them from a fixed distribution might introduce additional computational costs. This should be discussed in depth in the paper. Finally, LRA is a pretty old benchmark for testing long-range-attention Transform
* This work presents the framework generalizing the random-feature based attention method.
* This work recombines the component functions $\phi(\cdot)$ and a learnable weight matrix $W$ presented in existing random-feature attention method, and does not present new idea for improving attention. Thus, the novelty of this work itself seems marginal. * The benefits of exploring other combinations are not convincingly demonstrated. While it is possible that certain unexplored combinations of component functions and learnable weight matrices could improve accuracy, training time, or mem
- Good overview and discussion about various random feature mechanisms for linearizing attention in Transformers. - Interesting observation that there is no clear random feature method excelling at all the tasks.
- Please rewrite Sec 3.4 in terms of pseudocode. Please do not point to specific lines of code. - Writing needs to be improved, for example the concepts are mentioned before they are defined. Def 3.2 defines a valid component function even though this concept was mentioned in line 262-263. - QMC is explored in the context of shift-invariant kernels in [1] and also in general random features [2]. It feels incremental without any theoretical results as the authors are merely combining different m
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Fault Detection and Control Systems
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
