Lessons on Parameter Sharing across Layers in Transformers

Sho Takase; Shun Kiyono

arXiv:2104.06022·cs.CL·June 5, 2023·24 cites

Lessons on Parameter Sharing across Layers in Transformers

Sho Takase, Shun Kiyono

PDF

Open Access 2 Repos

TL;DR

This paper introduces flexible parameter sharing strategies for Transformer models, improving efficiency in computational time and parameter size while maintaining effectiveness on large datasets.

Contribution

It proposes three novel parameter sharing strategies—Sequence, Cycle, and Cycle (rev)—that relax existing sharing techniques to enhance efficiency.

Findings

01

Strategies reduce parameter size and computational time.

02

Effective on large datasets like WMT.

03

Maintain performance with fewer parameters.

Abstract

We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications