Procedural Pretraining: Warming Up Language Models with Abstract Data
Liangze Jiang, Zachary Shinnick, Anton van den Hengel, Hemanth Saratchandran, Damien Teney

TL;DR
Procedural pretraining on abstract structured data significantly enhances language models' reasoning skills, accelerates training, and improves performance across various tasks by instilling beneficial structural properties.
Contribution
This paper introduces procedural pretraining with formal language data as a novel approach to improve reasoning and efficiency in language models, distinct from standard natural language pretraining.
Findings
Pretraining on procedural data boosts algorithmic skills (e.g., context recall) dramatically.
Procedural pretraining reduces data requirements by up to 45%.
It induces structural changes in attention and MLP layers beneficial for structured tasks.
Abstract
Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a means to ease the subsequent acquisition of rich semantic knowledge, much like humans learn simple logic and mathematics before higher reasoning. We specifically focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, on context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1% procedural data significantly…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper provides a well-structured and technically sound analysis into the benefits of utilizing procedural knowledge for pretraining. The paper’s experiments are overall extensive and detailed, and the results provide clear benefits to pretraining on procedural data before pretraining on semantic data. I would like to highlight that I appreciate the authors’ limitations provided at the end of the main manuscript.
As mentioned by the authors, the paper is limited to exploring model performance at the pretraining stage (measured via perplexities) and does not further focus on downstream task performance after fine-tuning. It would be highly interesting to see how differently trained models behave and perform when properly fine-tuned, to get a better understanding of how such pretrained models generate natural language.
- Studies the different procedural data generation methods that previous works have proposed and their own (Stack) comprehensively in terms of the downstream algorithmic capabilities and shows that for example Dyck significantly improves long-context modeling, whereas Set is good for sorting. - The analyses on what weights are important to transfer and mixing the different procedural data offer some good insights as to how we are getting better transfer to downstream tasks.
- The claims of how these procedural data improve language domain pre-training including math and code have been proposed by previous works cited in the summary, and I believe the distinction between the claims demonstrated in this paper vs. the previous ones are blurred. It should be clarified what the claims of this paper are, and I still believe the authors have conducted a more in-depth, intriguing study of synthetic data for pre-pre-training. - In Figure 5, the authors only study Union, Sor
The paper is original in reframing pretraining itself as a two-stage process that separates the acquisition of reasoning from semantic knowledge. The idea of using procedurally generated data rather than linguistic or synthetic text as a lightweight reasoning curriculum is both conceptually elegant and practically effective. The empirical evidence is strong and comprehensive, covering both diagnostic algorithmic tasks and real-world domains like code, math, and language, with clear methodology a
While the results are compelling, the experiments are limited to relatively small GPT-2–scale models, leaving open how well these effects hold at modern pretraining scales. The analysis focuses on loss and perplexity rather than deeper reasoning or interpretability metrics, so it’s unclear whether the gains reflect genuine reasoning improvements or better optimization. The mixture and weight-composition results, though promising, remain proof-of-concept and could benefit from a more systematic e
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Materials Science
