Unified Scaling Laws for Routed Language Models
Aidan Clark, Diego de las Casas, Aurelia Guy, Arthur Mensch, Michela, Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai,, Sebastian Borgeaud, George van den Driessche, Eliza Rutherford, Tom Hennigan,, Matthew Johnson, Katie Millican, Albin Cassirer, Chris Jones

TL;DR
This paper establishes unified scaling laws for Routing Networks, showing how their performance depends on both parameter count and computational effort, and compares different routing techniques across a wide range of model sizes.
Contribution
It introduces generalized scaling laws for Routing Networks that account for both parameters and computation, and applies these laws to compare routing techniques quantitatively.
Findings
Scaling laws accurately model Routing Network performance.
An effective parameter count aligns different models' scaling behaviors.
Routing techniques differ significantly in their scaling coefficients.
Abstract
The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks
