FLARE: Fast Low-rank Attention Routing Engine
Vedant Puri, Aditya Joglekar, Sri Datta Ganesh Bandreddi, Kevin Ferguson, Yu-hsuan Chen, Yongjie Jessica Zhang, Levent Burak Kara

TL;DR
FLARE introduces a low-rank attention mechanism that significantly improves the scalability and efficiency of transformers for long sequences by routing information through latent tokens, achieving state-of-the-art results on benchmarks.
Contribution
The paper presents FLARE, a novel low-rank attention routing engine that reduces complexity and enhances scalability of transformers using a minimal encode-decode approach with standard SDPA.
Findings
Scales to one-million-point unstructured meshes on a single GPU
Achieves state-of-the-art accuracy on PDE surrogate benchmarks
Outperforms existing efficient-attention methods on Long Range Arena
Abstract
The quadratic complexity of self-attention limits the scalability of transformers on long sequences. We introduce Fast Low-rank Attention Routing Engine (FLARE), a token-mixing operator that realizes low-rank attention by routing information through a small set of latent tokens. Each layer induces an input-input token mixing matrix of rank at most via a minimal encode-decode factorization implemented using only two standard scaled dot-product attention (SDPA) calls. Because the dominant computation is expressed purely in terms of standard SDPA, FLARE is compatible with fused attention kernels and avoids materializing projection matrices. FLARE further assigns disjoint latent slices to each attention head, yielding a mixture of head-specific low-rank pathways. Empirically, FLARE scales to one-million-point unstructured meshes on a single GPU, achieves…
Peer Reviews
Decision·Submitted to ICLR 2026
1、The idea of low-rank self-attention is intersting. This wrok provides a good explanation on architectures design for the neural operator domain, especially classical works and methods based on Transformer architectures. 2、A new dataset, LPBF, is proposed, making a valuable contribution to the advancement of research in this field.
1、The experimental datasets do not include one-dimensional or time-dependent PDE problems. It is recommended to add experiments demonstrating the model’s generality, for example by including shallow-water and reaction-diffusion equations from PDEBench as time-dependent cases.(Takamoto, Makoto, et al. "Pdebench: An extensive benchmark for scientific machine learning." Advances in Neural Information Processing Systems 35 (2022): 1596-1611.) 2、Since Mamba also achieves linear computational complexi
The proposed method achieves sota PDE surrogate performance, while is also capable of handling geometries with a million points.
I think it is mandatory that in the experiments, this work also compares with other efficient attention methods proposed in general domains.
1) Authors demonstrate a transformer based surrogate with global communication training directly on ~1M-point meshes on a single H100 80GB GPU using off-the-shelf fused attention kernels, and provide scaling curves. Scalability on experiments is something that has not gotten enough attention in the literature in this field, which makes these experiments the strong point. 2) They retrain all baselines (Perceiver IO, Transolver, LNO, etc.) under standardized splits, resolutions, hyperparameters,
1) FLARE’s low-rank attention via a latent bottleneck is directly linked to the attentions of Perceiver/Perceiver IO (iterative cross-attention into a fixed-size latent that scales linearly with input size) and Linformer (self-attention is low-rank; approximate it to get O(N) complexity). There are various methods along the same direction which makes the novelty of the concept unclear. If the novelty lies in O(N^2) -> O(N.M) then the concept is not new. If the mechanism by which this procedure h
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Advanced Computing and Algorithms · Machine Learning and ELM
