FutureFill: Fast Generation from Convolutional Sequence Models
Naman Agarwal, Xinyi Chen, Evan Dogariu, Devan Shah, Hubert Strauss, Vlad Feinberg, Daniel Suo, Peter Bartlett, Elad Hazan

TL;DR
FutureFill is a novel method that significantly accelerates auto-regressive sequence generation using convolutional models, reducing computation time and cache size requirements, thus enabling faster and more efficient sequence prediction.
Contribution
It introduces FutureFill, a general-purpose technique that reduces generation complexity from quadratic to quasilinear and minimizes cache size growth during prompt-based generation.
Findings
Achieves substantial speedups in sequence generation tasks.
Reduces cache size growth during prompt-based generation.
Validates theoretical efficiency gains with experiments.
Abstract
We address the challenge of efficient auto-regressive generation in sequence prediction models by introducing FutureFill, a general-purpose fast generation method for any sequence prediction algorithm based on convolutional operators. FutureFill reduces generation time from quadratic to quasilinear in the context length. Moreover, when generating from a prompt, it requires a prefill cache whose size grows only with the number of tokens to be generated, often much smaller than the caches required by standard convolutional or attention based models. We validate our theoretical claims with experiments on synthetic tasks and demonstrate substantial efficiency gains when generating from a deep convolutional sequence prediction model.
Peer Reviews
Decision·ICLR 2026 Poster
- Addresses a less-studied yet important bottleneck in convolutional language models. - Strong theoretical foundation with clear runtime and correctness guarantees. - Training-free and exact—no compromise on model quality. - Shows consistent practical gains and integrates seamlessly with existing architectures. - Provides clear implementation details enabling reproducibility.
- Scalability to multi-billion parameter models not yet validated. - Baselines limited; comparison with Hyena, RWKV, and S4 models would be useful. - Reports only latency metrics; including FLOPs or energy-based analysis would make results more hardware-agnostic. - Hardware dependency unclear—speedups may vary across GPUs/TPUs. - Memory–latency trade-offs and cache behavior could be analyzed more deeply.
* Overall, I think this is quite a strong paper. The use of convolutional approaches as in e.g. Hyena is an important direction to address the quadratic complexity issue, and this paper’s contribution looks like an important step in that, and could well be adopted quite widely as a source of performance improvements.
These are mostly relatively minor. * The Abstract is fairly short and bare-bones; it’s not really until reading the paper that the actual importance of the work comes through. * It’s fairly reasonable given space constraints to save most of the literature review of Sec 1.1 for the appendix. However, something that I thought was missing in both Sec 1.1 and the appendix’s extended version was the discussion of differences wrt Oncescu et al. (2024). This is presented as an independent work that
1: The paper tackles a critical and well-known bottleneck in sequence modeling, i.e. the slow quadratic-time generation process for models based on convolutions. Making this efficient, especially in long-sequnece scenerios is a major practical contribution. 2. The paper is clearly written and the theoretical claims are well-supported by theorems and complexity. The experiments not only demonstrate asymptotic behavior but also wall-clock time improvements. 3. The idea of futurefill is intuitive
1. The paper points out that there has been independent and concurrent work that achieves the same runtime complexity, which slightly tempers the novelty. 2. The paper seems to focus only on FlashSTU-T model and would be nice to show how this is generalizable to other convolutional models (e.g. Hyena). 3. While the paper does demonstrate real speed improvement of 2x, it was not as dramatic as the difference from $O(L^2)$ to $O(L)$. More detailed analysis here on why the speedup was not signifi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
