ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Federico Danieli; Pau Rodriguez; Miguel Sarabia; Xavier Suau; Luca Zappella

arXiv:2510.21450·cs.LG·November 4, 2025

ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Federico Danieli, Pau Rodriguez, Miguel Sarabia, Xavier Suau, Luca Zappella

PDF

Open Access 3 Reviews

TL;DR

ParaRNN introduces a parallel training framework for nonlinear RNNs, overcoming the sequential bottleneck and enabling large-scale training with significant speedups, matching the performance of Transformers.

Contribution

It presents a novel parallelization method for nonlinear RNNs, allowing efficient training at large scales and broadening the scope of sequence modeling architectures.

Findings

01

Achieved up to 665x speedup over naive sequential methods.

02

Successfully trained 7B parameter nonlinear RNNs with competitive perplexity.

03

Open-sourced the ParaRNN framework for community use.

Abstract

Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 3

Strengths

1. Tackling the core sequential bottleneck of nonlinear RNNs is a long-standing challenge, so the motivation of the paper is clear. The provided solution, i.e. recasting recurrence evaluation as a global nonlinear system of equations and applying Newton iterations combined with parallel prefix reductions, is conceptually clean and theoretically grounded from previous work. 2. The authors release **FlashRNN**, a PyTorch + CUDA framework that generalizes to arbitrary RNN cells, lowering adoption

Weaknesses

1. As the authors write, imposing diagonal or block-diagonal Jacobians simplifies parallelization but may severely limit expressivity due to channel mixing. The solution taken is similar as in SSMs, i.e. using downstream MLPs to "restore" expressivity. I believe this needs stronger empirical support on more state-tracking tasks where xLSTM performs relatively well. 2. The authors assume that a small, fixed number of Newton iterations (e.g., 3) suffice. However, this is not guaranteed across oth

Reviewer 02Rating 6Confidence 3

Strengths

The proposed idea casting nonlinear RNN application into a Newton+scan routine with a specialized CUDA solver is well motivated and clearly explained. Thoughtful GPU hierarchy design, e.g. Appendix D2. The author promised code release which would encourage the community try the proposed method.

Weaknesses

It's unclear to me how the the diagonal (GRU) and block-diagonal (LSTM) Jacobians limitation been overcomes. Any study shows those limitation not matters at scale? I'd suggest add more opensource baseline in Table 2. It's unclear numerical stability when further scale up the model. It would be interesting to see more ablation on newton iterations, how does the memory footprint change as context length change, and how far the context length can be pushed.

Reviewer 03Rating 8Confidence 4

Strengths

1. Novelty: Prior work typically applies Newton/scan to given RNNs (e.g., Lim; Gonzalez). This paper instead redesigns LSTM/GRU so their Jacobians are diagonal or 2×2 block-diagonal, which makes each Newton step’s linear system amenable to efficient parallel reduction without runtime Jacobian approximations. 2. Scale of Experiments: This is the most exciting part. The paper successfully trains 7B-parameter FlashGRU/FlashLSTM models and reports their downstream task results. This proves that Fla

Weaknesses

N/A

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Materials Science · Model Reduction and Neural Networks