Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers
Machel Reid, Edison Marrese-Taylor, Yutaka Matsuo

TL;DR
Subformer introduces a novel parameter-sharing approach for generative Transformers, combining sandwich-style sharing and SAFE, leading to more parameter-efficient models that outperform traditional Transformers in various NLP tasks.
Contribution
The paper proposes the Subformer, a new parameter-sharing method that enhances efficiency and performance in generative Transformer models.
Findings
Subformer outperforms standard Transformers with fewer parameters.
Sandwich-style parameter sharing overcomes naive sharing limitations.
SAFE improves embedding efficiency in the model.
Abstract
Transformers have shown improved performance when compared to previous architectures for sequence processing such as RNNs. Despite their sizeable performance gains, as recently suggested, the model is computationally expensive to train and with a high parameter budget. In light of this, we explore parameter-sharing methods in Transformers with a specific focus on generative models. We perform an analysis of different parameter sharing/reduction methods and develop the Subformer. Our model combines sandwich-style parameter sharing, which overcomes naive cross-layer parameter sharing in generative models, and self-attentive embedding factorization (SAFE). Experiments on machine translation, abstractive summarization and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Subformer · Softmax · Dropout · Byte Pair Encoding · Dense Connections · Label Smoothing · Multi-Head Attention
