HRM-Text: Efficient Pretraining Beyond Scaling
Guan Wang, Changling Liu, Chenyu Wang, Cai Zhou, Yuhao Sun, Yifei Wu, Shuai Zhen, Luca Scimeca, Yasin Abbasi Yadkori

TL;DR
HRM-Text introduces a biologically inspired hierarchical recurrent model for language pretraining, achieving competitive performance with significantly less data and compute, thus lowering barriers for foundational research.
Contribution
The paper presents HRM-Text, a novel hierarchical recurrent architecture with new stabilization techniques, enabling efficient pretraining on limited data and compute.
Findings
Achieves 60.7% on MMLU with only 40 billion tokens
Uses 100-900x less tokens and 96-432x less compute than standard models
Performs competitively with larger open models (2-7B parameters)
Abstract
The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
