Lessons on Parameter Sharing across Layers in Transformers
Sho Takase, Shun Kiyono

TL;DR
This paper introduces flexible parameter sharing strategies for Transformer models, improving efficiency in computational time and parameter size while maintaining effectiveness on large datasets.
Contribution
It proposes three novel parameter sharing strategies—Sequence, Cycle, and Cycle (rev)—that relax existing sharing techniques to enhance efficiency.
Findings
Strategies reduce parameter size and computational time.
Effective on large datasets like WMT.
Maintain performance with fewer parameters.
Abstract
We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
