Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

Ivan Rodkin; Daniil Orel; Konstantin Smirnov; Arman Bolatov; Bilal Elbouardi; Besher Hassan; Yuri Kuratov; Aydar Bulatov; Preslav Nakov; Timothy Baldwin; Artem Shelmanov; Mikhail Burtsev

arXiv:2508.16745·cs.LG·May 8, 2026

Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

Ivan Rodkin, Daniil Orel, Konstantin Smirnov, Arman Bolatov, Bilal Elbouardi, Besher Hassan, Yuri Kuratov, Aydar Bulatov, Preslav Nakov, Timothy Baldwin, Artem Shelmanov, Mikhail Burtsev

PDF

1 Repo 1 Datasets

TL;DR

This paper investigates how large language models perform multi-step reasoning, highlighting the importance of model depth and the limitations of current methods like recurrence and memory in extending reasoning capabilities.

Contribution

It introduces a controlled cellular automata framework to study reasoning, demonstrating the impact of model depth and the bounded benefits of recurrence, memory, and test-time compute.

Findings

01

Neural architectures can learn rule inference with high accuracy on short sequences.

02

Performance declines sharply as reasoning steps increase.

03

Increasing model depth and using recurrence or memory improves reasoning, but benefits are bounded.

Abstract

Reasoning is a core capability of large language models, yet how multi-step reasoning is learned and executed remains unclear. We study this question in a controlled cellular-automata (1dCA) framework that excludes memorisation by using disjoint training and test rules. Given a short state sequence, the model is required to infer the hidden local rule and then chain it to predict multiple future steps. Our evaluation shows that LLMs largely fail to reliably solve a natural-language proxy of the proposed task. We find that most neural architectures trained from scratch can learn rule inference and achieve high next-step accuracy, but performance drops sharply as the required number of intermediate reasoning steps increases. Experiments show that increasing model depth is crucial, and extending effective depth via recurrence, memory, or test-time compute improves results but remains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RodkinIvan/associative-recurrent-memory-transformer/tree/ACT
github

Datasets

irodkin/1dCA_r2s20T20
dataset· 60 dl
60 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.