Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond
Costin-Andrei Oncescu, Sanket Purandare, Stratos Idreos, Sham Kakade

TL;DR
This paper introduces a novel method to accelerate inference in long convolution sequence models, achieving near-linear time complexity and significant speedups over standard approaches, enabling more efficient sequence processing.
Contribution
The authors develop a general framework for quasilinear inference in LCSMs, significantly reducing inference time and enabling high parallelization, with empirical validation on Hyena.
Findings
Up to 7.8× end-to-end speedup in inference
110× speedup within position-mixing component
Achieves O(L log^2 L) inference complexity
Abstract
While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs' exact inference to quasilinear time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the…
Peer Reviews
Decision·ICLR 2025 Poster
* As far as I know, the interpolation perspective presented in this paper is original and inspiring. The writing is exceptionally clear, and Figure 1 has been extremely helpful in understanding the proposed method. * The algorithm introduced in the paper largely solves the long-standing problem of quadratic inference complexity for long convolution models like SGConv and Hyena. This has been a significant bottleneck for the practical deployment of these architectures (note that there are also
This is a good paper and there is no much weakness to say about its methodology. However, I find the significance of the work depends on a line of work on long convolution architectures that the authors unfortunately have not discussed or compared. Long convolution kernels can be contructed from smaller convolutions in a tree style hierarchical dilations such as those in WaveNet [1]. Recently, people have shown that these architectures, with nonlinearities removed and weight sharing, can be int
The paper's main strength is that it is technically sound and novel, and addresses the problem it aims to solve. - The technical writing is clear and is helped by the inclusion of helpful graphics and rigorous algorithm boxes. - Many considerations and variants of the core algorithm are proposed. - An actual implementation is provided and all variants and baselines are benchmarked empirically. There is a conception that long convolutions cannot be implemented efficiently in autoregressive infer
While the paper provides a technical contribution, the paper's main weakness is that of significance and direction with respect to the broader field; it aims to solve a problem that I believe does not need solving. Correspondingly, the papers writing (in terms of positioning and related works / baselines) could also use improvement. - The paper's related work is sparse and I think it is important to present the lineage of these models more carefully. The original (depth separable) LCSMs were in
- The method is novel and offers interesting improvements in the inference speed of LCSMs. - The paper offers interesting perspectives that could be used for the design of more efficient (causal, input-dependent) LCSMs in the future.
- The main weakness of the paper is that the presentation, design decisions and final implementation of the method remains quite abstract, even after reading the paper multiple times. Given that the paper presents an inference strategy, it should be feasible to have an stand-alone implementation (at least for one layer) incorporated in the Appendix of the paper. This would give clarity to the final, concrete version of the algorithm. - Next, I feel that the presentation of the paper could be im
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSeismic Imaging and Inversion Techniques · Reservoir Engineering and Simulation Methods · Time Series Analysis and Forecasting
MethodsConvolution
