Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence
Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, Micah Goldblum

TL;DR
This paper explores converting pretrained non-recurrent language models into depth-recurrent models using a curriculum of recurrences, leading to improved performance and reduced computational costs.
Contribution
It introduces a method to retrofit existing pretrained models into depth-recurrent models with a curriculum approach, enhancing efficiency and performance.
Findings
Recurrent models outperform non-recurrent ones at the same compute budget.
Curriculum-based training preserves performance while reducing total computational cost.
Recurrent models show improved results in mathematical tasks.
Abstract
Recent advances in depth-recurrent language models show that recurrence can decouple train-time compute and parameter count from test-time compute. In this work, we study how to convert existing pretrained non-recurrent language models into depth-recurrent models. We find that using a curriculum of recurrences to increase the effective depth of the model over the course of training preserves performance while reducing total computational cost. In our experiments, on mathematics, we observe that converting pretrained models to recurrent ones results in better performance at a given compute budget than simply post-training the original non-recurrent language model.
Peer Reviews
Decision·Submitted to ICLR 2026
Nice practical study on an important problem given the excitement around depth recurrence and test-time compute in the community. A solid set of experiments and ablations are provided and I think this paper will be a useful reference to practitioners in the field.
-I found the terminology, 'Tiny Llama' and 'Llama' to be confusing. I think it would be clearer if 'Llama' also had a prefix to make it clear there are two different models.
1. clear motivation (expensive training of depth-recurrent models) and a practical idea (leveraging heavily trained Llama models) 1. intuitive method for re-purposing transformer blocks from pretrained fixed-depth LLMs. 1. effective training regime strategy that notably incorporates a recurrence-scheduling curriculum, adapted from recent works. 1. significant amount of ablations (architectural configuration, layer selection, initialization, optimizer, training phases, data mixtures, etc.) that
Presentation/Paper Organization 1. The abstract is very insufficient; while concise, it lacks important details that help explain the paper. 1. The terminology used creates quite a bit of confusion. Notably, the terms "surgery", initialize", "convert", and "retrofit" seem to be used to describe overlapping concepts. For example, retrofit is used to describe the method altogether, but also specifically the retraining part. It took me a long while to understand what was going on because of this.
This is clearly an empirical paper, and the authors did a reasonable job in describing and conducting the experiments. While not particularly strong on the methodological side, the main insight that reuse of pretrained feed-forward weights for latent recurrent networks is practically useful.
1. The paper shows that it is beneficial to initialize the weights in a latent recurrent model with pretrained feed-forward weights. However, it is not clear if this approach is overall compute optimal, i.e., FLOPS(pretrain feed-forward)+FLOPS(post-train recurrent) > FLOPS(only pretrain recurrent (maybe for longer)). Hence, we’re still missing a clear compute-optimal recipe for training latent recurrent models. 2. The reasoning results show the feed-forward performances without test-time scaling
Code & Models
- 🤗smcleish/Recurrent-Llama-3.2-train-recurrence-32model· 709 dl· ♡ 1709 dl♡ 1
- 🤗smcleish/Recurrent-Llama-3.2-train-recurrence-16model· 6 dl6 dl
- 🤗smcleish/Recurrent-Llama-3.2-train-recurrence-8model· 381 dl381 dl
- 🤗smcleish/Recurrent-Llama-3.2-train-recurrence-4model· 40 dl40 dl
- 🤗smcleish/Recurrent-Llama-3.2-untrainedmodel· 44 dl44 dl
- 🤗smcleish/Recurrent-Llama-3.2-2-4-2-untrainedmodel· 1 dl· ♡ 11 dl♡ 1
- 🤗smcleish/Recurrent-TinyLlama-3T-train-recurrence-4model· 3 dl3 dl
- 🤗smcleish/Recurrent-TinyLlama-3T-train-recurrence-8model· 2 dl2 dl
- 🤗smcleish/Recurrent-TinyLlama-3T-train-recurrence-16model· 2 dl· ♡ 12 dl♡ 1
- 🤗smcleish/Recurrent-TinyLlama-3T-train-recurrence-32model· 340 dl· ♡ 1340 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
