Subformer: Exploring Weight Sharing for Parameter Efficiency in   Generative Transformers

Machel Reid; Edison Marrese-Taylor; Yutaka Matsuo

arXiv:2101.00234·cs.CL·September 9, 2021

Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers

Machel Reid, Edison Marrese-Taylor, Yutaka Matsuo

PDF

Open Access 1 Repo

TL;DR

Subformer introduces a novel parameter-sharing approach for generative Transformers, combining sandwich-style sharing and SAFE, leading to more parameter-efficient models that outperform traditional Transformers in various NLP tasks.

Contribution

The paper proposes the Subformer, a new parameter-sharing method that enhances efficiency and performance in generative Transformer models.

Findings

01

Subformer outperforms standard Transformers with fewer parameters.

02

Sandwich-style parameter sharing overcomes naive sharing limitations.

03

SAFE improves embedding efficiency in the model.

Abstract

Transformers have shown improved performance when compared to previous architectures for sequence processing such as RNNs. Despite their sizeable performance gains, as recently suggested, the model is computationally expensive to train and with a high parameter budget. In light of this, we explore parameter-sharing methods in Transformers with a specific focus on generative models. We perform an analysis of different parameter sharing/reduction methods and develop the Subformer. Our model combines sandwich-style parameter sharing, which overcomes naive cross-layer parameter sharing in generative models, and self-attentive embedding factorization (SAFE). Experiments on machine translation, abstractive summarization and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

machelreid/subformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Subformer · Softmax · Dropout · Byte Pair Encoding · Dense Connections · Label Smoothing · Multi-Head Attention