TL;DR
Parcae introduces a stable looped architecture for language models that improves scaling efficiency and reduces perplexity by constraining spectral norms, enabling better performance with fixed parameters and data.
Contribution
The paper proposes Parcae, a novel stable looped architecture that addresses instability issues and provides predictable scaling laws for FLOPs in training and testing.
Findings
Parcae achieves up to 6.3% lower validation perplexity than prior models.
Scaling laws suggest increasing FLOPs and data in tandem improves model quality.
Parcae improves language model quality by up to 87.5% relative to larger Transformer baselines.
Abstract
Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a higher memory footprint, or data. A potential alternative is looped architectures, which instead increase FLOPs by sending activations through a block of layers in a loop. While promising, existing recipes for training looped architectures can be unstable, suffering from residual explosion and loss spikes. We address these challenges by recasting looping as a nonlinear time-variant dynamical system over the residual stream. Via a linear approximation to this system, we find that instability occurs in existing looped architectures as a result of large spectral norms in their injection parameters. To address these instability issues, we propose Parcae, a novel stable, looped architecture that constrains the spectral norm of the injection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗SandyResearch/parcae-140mmodel· 2.7k dl· ♡ 32.7k dl♡ 3
- 🤗SandyResearch/parcae-370mmodel· 3.4k dl· ♡ 23.4k dl♡ 2
- 🤗SandyResearch/parcae-770mmodel· 2.7k dl· ♡ 42.7k dl♡ 4
- 🤗SandyResearch/parcae-1.3bmodel· 2.8k dl· ♡ 62.8k dl♡ 6
- 🤗trojan0x/ultronmodel
- 🤗trojan0x/ultron-small-baselinemodel· 8 dl· ♡ 18 dl♡ 1
- 🤗maidacundo/open-mythos-hfmodel
- 🤗Lgr54HFi/chimeramodel· 52 dl52 dl
- 🤗Lgr54HFi/ch1meramodel· 149 dl149 dl
- 🤗Lgr54HFi/chomeramodel· 342 dl342 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
