TL;DR
This paper investigates the value of recurrence in looped transformers by measuring how shared recurrences compare to unique blocks in terms of validation loss and training compute, introducing a diagnostic exponent.
Contribution
It introduces a scaling law and a recurrence-equivalence exponent to quantify the impact of recurrence sharing versus unique blocks in transformer models.
Findings
The recurrence-equivalence exponent φ is approximately 0.46, indicating partial equivalence between shared recurrences and unique blocks.
Replacing unique blocks with shared recurrences increases validation loss at the same training compute.
Truncated backpropagation lowers φ, showing poorer training of the loop mechanism, while hyperconnections raise φ, indicating capacity gains.
Abstract
We measure how much one recurrence is worth to a looped (depth-recurrent) transformer, in equivalent unique parameters. From an iso-depth pretraining sweep across recurrence counts spanning in training compute, we fit a joint scaling law and measure a recurrence-equivalence exponent . Intuitively, tells us whether looping a block times is equivalent in validation loss to unique blocks of a non-looped model (full equivalence, ) or to a single block run repeatedly with no capacity gain (). Our sits in between, so replacing unique blocks with shared recurrences increases validation loss at matched training compute. For example, at a 410M looped model performs on par with a 580M…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
