Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Costin-Andrei Oncescu; Sanket Purandare; Stratos Idreos; Sham Kakade

arXiv:2410.12982·cs.LG·November 12, 2025

Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Costin-Andrei Oncescu, Sanket Purandare, Stratos Idreos, Sham Kakade

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel method to accelerate inference in long convolution sequence models, achieving near-linear time complexity and significant speedups over standard approaches, enabling more efficient sequence processing.

Contribution

The authors develop a general framework for quasilinear inference in LCSMs, significantly reducing inference time and enabling high parallelization, with empirical validation on Hyena.

Findings

01

Up to 7.8× end-to-end speedup in inference

02

110× speedup within position-mixing component

03

Achieves O(L log^2 L) inference complexity

Abstract

While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs' exact inference to quasilinear $O (L lo g^{2} L)$ time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

* As far as I know, the interpolation perspective presented in this paper is original and inspiring. The writing is exceptionally clear, and Figure 1 has been extremely helpful in understanding the proposed method. * The algorithm introduced in the paper largely solves the long-standing problem of quadratic inference complexity for long convolution models like SGConv and Hyena. This has been a significant bottleneck for the practical deployment of these architectures (note that there are also

Weaknesses

This is a good paper and there is no much weakness to say about its methodology. However, I find the significance of the work depends on a line of work on long convolution architectures that the authors unfortunately have not discussed or compared. Long convolution kernels can be contructed from smaller convolutions in a tree style hierarchical dilations such as those in WaveNet [1]. Recently, people have shown that these architectures, with nonlinearities removed and weight sharing, can be int

Reviewer 02Rating 5Confidence 4

Strengths

The paper's main strength is that it is technically sound and novel, and addresses the problem it aims to solve. - The technical writing is clear and is helped by the inclusion of helpful graphics and rigorous algorithm boxes. - Many considerations and variants of the core algorithm are proposed. - An actual implementation is provided and all variants and baselines are benchmarked empirically. There is a conception that long convolutions cannot be implemented efficiently in autoregressive infer

Weaknesses

While the paper provides a technical contribution, the paper's main weakness is that of significance and direction with respect to the broader field; it aims to solve a problem that I believe does not need solving. Correspondingly, the papers writing (in terms of positioning and related works / baselines) could also use improvement. - The paper's related work is sparse and I think it is important to present the lineage of these models more carefully. The original (depth separable) LCSMs were in

Reviewer 03Rating 8Confidence 4

Strengths

- The method is novel and offers interesting improvements in the inference speed of LCSMs. - The paper offers interesting perspectives that could be used for the design of more efficient (causal, input-dependent) LCSMs in the future.

Weaknesses

- The main weakness of the paper is that the presentation, design decisions and final implementation of the method remains quite abstract, even after reading the paper multiple times. Given that the paper presents an inference strategy, it should be feasible to have an stand-alone implementation (at least for one layer) incorporated in the Appendix of the paper. This would give clarity to the final, concrete version of the algorithm. - Next, I feel that the presentation of the paper could be im

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSeismic Imaging and Inversion Techniques · Reservoir Engineering and Simulation Methods · Time Series Analysis and Forecasting

MethodsConvolution