Data-Aware Random Feature Kernel for Transformers

Amirhossein Farzam; Hossein Mobahi; Nolan Andrew Miller; Luke Sernau

arXiv:2603.04127·cs.LG·March 5, 2026

Data-Aware Random Feature Kernel for Transformers

Amirhossein Farzam, Hossein Mobahi, Nolan Andrew Miller, Luke Sernau

PDF

Open Access

TL;DR

DARKFormer introduces a data-aligned kernel attention mechanism for transformers that reduces variance and improves performance in resource-limited settings by combining random features with input-aware geometry.

Contribution

The paper proposes a novel data-aligned kernel for random-feature attention, enabling efficient importance sampling and improved training stability in transformers.

Findings

01

DARKFormer narrows the performance gap with exact softmax attention.

02

It improves training stability and efficiency in resource-constrained environments.

03

Empirical results show enhanced finetuning performance with anisotropic pretrained representations.

Abstract

Transformers excel across domains, yet their quadratic attention complexity poses a barrier to scaling. Random-feature attention, as in Performers, can reduce this cost to linear in the sequence length by approximating the softmax kernel with positive random features drawn from an isotropic distribution. In pretrained models, however, queries and keys are typically anisotropic. This induces high Monte Carlo variance in isotropic sampling schemes unless one retrains the model or uses a large feature budget. Importance sampling can address this by adapting the sampling distribution to the input geometry, but complex data-dependent proposal distributions are often intractable. We show that by data aligning the softmax kernel, we obtain an attention mechanism which can both admit a tractable minimal-variance proposal distribution for importance sampling, and exhibits better training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis