
TL;DR
The paper introduces the Parallel Recursive LSTM (PR-LSTM), a hierarchical recurrent model that enhances parallelism and efficiency in sequence modeling by replacing linear recurrence with recursive state composition, outperforming traditional RNNs, LSTMs, and Transformers on formal-language tasks.
Contribution
The novel PR-LSTM architecture reorganizes recurrent computation hierarchically, reducing parallel depth and maintaining nonlinear state representations, enabling efficient long-sequence processing.
Findings
PR-LSTM achieves strong sequence-length generalization on formal-language benchmarks.
It solves more tasks than standard RNN, LSTM, and Transformer baselines.
PR-LSTM avoids the quadratic scaling of attention in long sequences.
Abstract
Transformers have become the dominant architecture for sequence modeling by using self-attention to enable expressive and highly parallel processing. However, the resulting quadratic time and memory costs limit efficiency in long-context settings. Recurrent models such as LSTMs provide explicit nonlinear state updates and strong state-tracking capabilities, yet their strictly sequential computation limits parallelism. We introduce the Parallel Recursive LSTM (PR-LSTM), a hierarchical recurrent architecture that replaces left-to-right recurrence with recursive nonlinear state composition over a balanced computation tree. Tokens are first mapped independently to latent states, which are then recursively merged by a learned gated composition block. This structure uses the reduction pattern underlying parallel scans as a fixed execution schedule, rather than assuming an associative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
