Data-Aware Random Feature Kernel for Transformers
Amirhossein Farzam, Hossein Mobahi, Nolan Andrew Miller, Luke Sernau

TL;DR
DARKFormer introduces a data-aligned kernel attention mechanism for transformers that reduces variance and improves performance in resource-limited settings by combining random features with input-aware geometry.
Contribution
The paper proposes a novel data-aligned kernel for random-feature attention, enabling efficient importance sampling and improved training stability in transformers.
Findings
DARKFormer narrows the performance gap with exact softmax attention.
It improves training stability and efficiency in resource-constrained environments.
Empirical results show enhanced finetuning performance with anisotropic pretrained representations.
Abstract
Transformers excel across domains, yet their quadratic attention complexity poses a barrier to scaling. Random-feature attention, as in Performers, can reduce this cost to linear in the sequence length by approximating the softmax kernel with positive random features drawn from an isotropic distribution. In pretrained models, however, queries and keys are typically anisotropic. This induces high Monte Carlo variance in isotropic sampling schemes unless one retrains the model or uses a large feature budget. Importance sampling can address this by adapting the sampling distribution to the input geometry, but complex data-dependent proposal distributions are often intractable. We show that by data aligning the softmax kernel, we obtain an attention mechanism which can both admit a tractable minimal-variance proposal distribution for importance sampling, and exhibits better training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis
