Hierarchical vs. Flat Iteration in Shared-Weight Transformers
Sang-Il Han

TL;DR
This study compares hierarchical shared-weight recurrence with independent-layer stacking in Transformers, finding a significant performance gap favoring the former in language modeling.
Contribution
It introduces a recurrent hierarchy model (HRM-LM) and empirically demonstrates its advantages over traditional Transformer stacking.
Findings
Shared-weight recurrence underperforms compared to independent-layer stacking.
The recurrent hierarchy achieves better representational quality in language models.
Universal Transformer ablation confirms the robustness of the findings.
Abstract
We present an empirical study of whether hierarchically structured, shared-weight recurrence can match the representational quality of independent-layer stacking in a Transformer-based language model. HRM-LM replaces L independent Transformer layers with a two-speed recurrent pair: a Fast module operating at every step for local refinement, and a Slow module operating every T steps for global compression. This recurrent hierarchy is unrolled for M = N x T steps with shared parameters. The central and most robust finding, supported by a parameter-matched Universal Transformer ablation (UniTF, 1.2B) across five independent runs, is a sharp empirical gap between the two approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
