Hierarchical vs. Flat Iteration in Shared-Weight Transformers

Sang-Il Han

arXiv:2604.14442·cs.CL·April 17, 2026

Hierarchical vs. Flat Iteration in Shared-Weight Transformers

Sang-Il Han

PDF

TL;DR

This study compares hierarchical shared-weight recurrence with independent-layer stacking in Transformers, finding a significant performance gap favoring the former in language modeling.

Contribution

It introduces a recurrent hierarchy model (HRM-LM) and empirically demonstrates its advantages over traditional Transformer stacking.

Findings

01

Shared-weight recurrence underperforms compared to independent-layer stacking.

02

The recurrent hierarchy achieves better representational quality in language models.

03

Universal Transformer ablation confirms the robustness of the findings.

Abstract

We present an empirical study of whether hierarchically structured, shared-weight recurrence can match the representational quality of independent-layer stacking in a Transformer-based language model. HRM-LM replaces L independent Transformer layers with a two-speed recurrent pair: a Fast module operating at every step for local refinement, and a Slow module operating every T steps for global compression. This recurrent hierarchy is unrolled for M = N x T steps with shared parameters. The central and most robust finding, supported by a parameter-matched Universal Transformer ablation (UniTF, 1.2B) across five independent runs, is a sharp empirical gap between the two approaches.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.