FutureFill: Fast Generation from Convolutional Sequence Models

Naman Agarwal; Xinyi Chen; Evan Dogariu; Devan Shah; Hubert Strauss; Vlad Feinberg; Daniel Suo; Peter Bartlett; Elad Hazan

arXiv:2410.03766·cs.LG·June 24, 2025

FutureFill: Fast Generation from Convolutional Sequence Models

Naman Agarwal, Xinyi Chen, Evan Dogariu, Devan Shah, Hubert Strauss, Vlad Feinberg, Daniel Suo, Peter Bartlett, Elad Hazan

PDF

Open Access 3 Reviews

TL;DR

FutureFill is a novel method that significantly accelerates auto-regressive sequence generation using convolutional models, reducing computation time and cache size requirements, thus enabling faster and more efficient sequence prediction.

Contribution

It introduces FutureFill, a general-purpose technique that reduces generation complexity from quadratic to quasilinear and minimizes cache size growth during prompt-based generation.

Findings

01

Achieves substantial speedups in sequence generation tasks.

02

Reduces cache size growth during prompt-based generation.

03

Validates theoretical efficiency gains with experiments.

Abstract

We address the challenge of efficient auto-regressive generation in sequence prediction models by introducing FutureFill, a general-purpose fast generation method for any sequence prediction algorithm based on convolutional operators. FutureFill reduces generation time from quadratic to quasilinear in the context length. Moreover, when generating from a prompt, it requires a prefill cache whose size grows only with the number of tokens to be generated, often much smaller than the caches required by standard convolutional or attention based models. We validate our theoretical claims with experiments on synthetic tasks and demonstrate substantial efficiency gains when generating from a deep convolutional sequence prediction model.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- Addresses a less-studied yet important bottleneck in convolutional language models. - Strong theoretical foundation with clear runtime and correctness guarantees. - Training-free and exact—no compromise on model quality. - Shows consistent practical gains and integrates seamlessly with existing architectures. - Provides clear implementation details enabling reproducibility.

Weaknesses

- Scalability to multi-billion parameter models not yet validated. - Baselines limited; comparison with Hyena, RWKV, and S4 models would be useful. - Reports only latency metrics; including FLOPs or energy-based analysis would make results more hardware-agnostic. - Hardware dependency unclear—speedups may vary across GPUs/TPUs. - Memory–latency trade-offs and cache behavior could be analyzed more deeply.

Reviewer 02Rating 6Confidence 2

Strengths

* Overall, I think this is quite a strong paper. The use of convolutional approaches as in e.g. Hyena is an important direction to address the quadratic complexity issue, and this paper’s contribution looks like an important step in that, and could well be adopted quite widely as a source of performance improvements.

Weaknesses

These are mostly relatively minor. * The Abstract is fairly short and bare-bones; it’s not really until reading the paper that the actual importance of the work comes through. * It’s fairly reasonable given space constraints to save most of the literature review of Sec 1.1 for the appendix. However, something that I thought was missing in both Sec 1.1 and the appendix’s extended version was the discussion of differences wrt Oncescu et al. (2024). This is presented as an independent work that

Reviewer 03Rating 8Confidence 3

Strengths

1: The paper tackles a critical and well-known bottleneck in sequence modeling, i.e. the slow quadratic-time generation process for models based on convolutions. Making this efficient, especially in long-sequnece scenerios is a major practical contribution. 2. The paper is clearly written and the theoretical claims are well-supported by theorems and complexity. The experiments not only demonstrate asymptotic behavior but also wall-clock time improvements. 3. The idea of futurefill is intuitive

Weaknesses

1. The paper points out that there has been independent and concurrent work that achieves the same runtime complexity, which slightly tempers the novelty. 2. The paper seems to focus only on FlashSTU-T model and would be nice to show how this is generalizable to other convolutional models (e.g. Hyena). 3. While the paper does demonstrate real speed improvement of 2x, it was not as dramatic as the difference from $O(L^2)$ to $O(L)$. More detailed analysis here on why the speedup was not signifi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification