Understanding Parameter Sharing in Transformers
Ye Lin, Mingxuan Wang, Zhexi Zhang, Xiaohui Wang, Tong Xiao, Jingbo, Zhu

TL;DR
This paper investigates why parameter sharing in Transformers improves performance, finding that better convergence, rather than increased complexity, largely explains its success, leading to more efficient models.
Contribution
The study reveals that improved convergence explains the effectiveness of parameter sharing in Transformers, and proposes hyperparameter tuning to enhance efficiency.
Findings
Parameter sharing improves convergence, not just model depth.
Shared-parameter models achieve competitive results with half the complexity.
Hyperparameter tuning further enhances shared-parameter model performance.
Abstract
Parameter sharing has proven to be a parameter-efficient approach. Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth. In this paper, we study why this approach works from two perspectives. First, increasing model depth makes the model more complex, and we hypothesize that the reason is related to model complexity (referring to FLOPs). Secondly, since each shared parameter will participate in the network computation several times in forward propagation, its corresponding gradient will have a different range of values from the original model, which will affect the model convergence. Based on this, we hypothesize that training convergence may also be one of the reasons. Through further analysis, we show that the success of this approach can be largely attributed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
