Scaling Probabilistic Transformer via Efficient Cross-Scale Hyperparameter Transfer
Penghao Kuang, Haoyi Wu, Kewei Tu

TL;DR
This paper introduces a method to scale Probabilistic Transformers efficiently by transferring hyperparameters from small to large models, enabling larger models with better performance without extra tuning.
Contribution
We apply Maximal Update Parametrization to Probabilistic Transformers, allowing hyperparameter transfer from small to large models, and demonstrate improved performance at larger scales.
Findings
PT outperforms standard Transformers at the same parameter budget.
Successful scaling of PT to models with up to 0.4B parameters.
Hyperparameter transfer enables efficient scaling without additional tuning.
Abstract
Probabilistic Transformer (PT), a white-box probabilistic model for contextual word representation, has demonstrated substantial similarity to standard Transformers in both computational structure and downstream task performance on small models and small to medium sized datasets. However, PT is less robust to hyperparameter choices than standard Transformers, making it harder to scale efficiently. In this work, we follow Maximal Update Parametrization (muP) to rescale PT's parameters, so that hyperparameters optimized on small models can be transferred to larger models without additional tuning. With this approach, we successfully scale PT to models with up to 0.4B parameters. Experiments show that PT consistently outperforms standard transformer under the same parameter budget on Masked Language Modeling (MLM) tasks. We hope this work will contribute to the practical deployment of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
