Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence

Sean McLeish; Ang Li; John Kirchenbauer; Dayal Singh Kalra; Brian R. Bartoldson; Bhavya Kailkhura; Avi Schwarzschild; Jonas Geiping; Tom Goldstein; Micah Goldblum

arXiv:2511.07384·cs.CL·November 11, 2025

Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence

Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, Micah Goldblum

PDF

Open Access 10 Models 1 Datasets 3 Reviews

TL;DR

This paper explores converting pretrained non-recurrent language models into depth-recurrent models using a curriculum of recurrences, leading to improved performance and reduced computational costs.

Contribution

It introduces a method to retrofit existing pretrained models into depth-recurrent models with a curriculum approach, enhancing efficiency and performance.

Findings

01

Recurrent models outperform non-recurrent ones at the same compute budget.

02

Curriculum-based training preserves performance while reducing total computational cost.

03

Recurrent models show improved results in mathematical tasks.

Abstract

Recent advances in depth-recurrent language models show that recurrence can decouple train-time compute and parameter count from test-time compute. In this work, we study how to convert existing pretrained non-recurrent language models into depth-recurrent models. We find that using a curriculum of recurrences to increase the effective depth of the model over the course of training preserves performance while reducing total computational cost. In our experiments, on mathematics, we observe that converting pretrained models to recurrent ones results in better performance at a given compute budget than simply post-training the original non-recurrent language model.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 3

Strengths

Nice practical study on an important problem given the excitement around depth recurrence and test-time compute in the community. A solid set of experiments and ablations are provided and I think this paper will be a useful reference to practitioners in the field.

Weaknesses

-I found the terminology, 'Tiny Llama' and 'Llama' to be confusing. I think it would be clearer if 'Llama' also had a prefix to make it clear there are two different models.

Reviewer 02Rating 4Confidence 4

Strengths

1. clear motivation (expensive training of depth-recurrent models) and a practical idea (leveraging heavily trained Llama models) 1. intuitive method for re-purposing transformer blocks from pretrained fixed-depth LLMs. 1. effective training regime strategy that notably incorporates a recurrence-scheduling curriculum, adapted from recent works. 1. significant amount of ablations (architectural configuration, layer selection, initialization, optimizer, training phases, data mixtures, etc.) that

Weaknesses

Presentation/Paper Organization 1. The abstract is very insufficient; while concise, it lacks important details that help explain the paper. 1. The terminology used creates quite a bit of confusion. Notably, the terms "surgery", initialize", "convert", and "retrofit" seem to be used to describe overlapping concepts. For example, retrofit is used to describe the method altogether, but also specifically the retraining part. It took me a long while to understand what was going on because of this.

Reviewer 03Rating 4Confidence 3

Strengths

This is clearly an empirical paper, and the authors did a reasonable job in describing and conducting the experiments. While not particularly strong on the methodological side, the main insight that reuse of pretrained feed-forward weights for latent recurrent networks is practically useful.

Weaknesses

1. The paper shows that it is beneficial to initialize the weights in a latent recurrent model with pretrained feed-forward weights. However, it is not clear if this approach is overall compute optimal, i.e., FLOPS(pretrain feed-forward)+FLOPS(post-train recurrent) > FLOPS(only pretrain recurrent (maybe for longer)). Hence, we’re still missing a clear compute-optimal recipe for training latent recurrent models. 2. The reasoning results show the feed-forward performances without test-time scaling

Code & Models

Models

Datasets

smcleish/retrofitting-llama-fineweb-edu-tokenized
dataset· 211 dl
211 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques