Parallel Scaling Law for Language Models
Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, Zhongxin Liu

TL;DR
This paper introduces parallel scaling (ParScale), a new inference-efficient method that increases model computation through parallel transformations, enabling similar or better performance with less memory and latency than traditional parameter scaling.
Contribution
The paper proposes ParScale, a novel parallel computation paradigm for language models, along with a new scaling law validated through large-scale experiments.
Findings
ParScale achieves comparable performance to parameter scaling with less memory and latency.
A new theoretical scaling law relates parallel streams to effective parameter scaling.
ParScale can convert pre-trained models into parallel versions with minimal additional training.
Abstract
It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with parallel streams is similar to scaling the parameters by while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ParScale/ParScale-0.7B-P1-Pilemodel· 1 dl1 dl
- 🤗ParScale/ParScale-0.7B-P1-Pythonmodel· 4 dl4 dl
- 🤗ParScale/ParScale-0.7B-P2-Pilemodel· 1 dl1 dl
- 🤗ParScale/ParScale-0.7B-P2-Pythonmodel· 2 dl2 dl
- 🤗ParScale/ParScale-0.7B-P4-Pilemodel· 2 dl2 dl
- 🤗ParScale/ParScale-0.7B-P4-Pythonmodel· 2 dl2 dl
- 🤗ParScale/ParScale-0.7B-P8-Pilemodel· 1 dl1 dl
- 🤗ParScale/ParScale-0.7B-P8-Pythonmodel· 1 dl1 dl
- 🤗ParScale/ParScale-0.9B-P1-Pythonmodel· 1 dl1 dl
- 🤗ParScale/ParScale-0.9B-P2-Pythonmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
