Rethinking the Value of Transformer Components

Wenxuan Wang; Zhaopeng Tu

arXiv:2011.03803·cs.CL·November 10, 2020·1 cites

Rethinking the Value of Transformer Components

Wenxuan Wang, Zhaopeng Tu

PDF

Open Access

TL;DR

This paper evaluates the importance of individual components in Transformer models for translation, revealing key insights and proposing a new training strategy to enhance performance by focusing on important components.

Contribution

It systematically analyzes component contributions in trained Transformers and introduces a novel training method that emphasizes important components for better translation results.

Findings

01

Certain components are consistently more important across models.

02

Some components have minimal impact on performance.

03

A new training strategy improves translation quality by focusing on key components.

Abstract

Transformer becomes the state-of-the-art translation model, while it is not well studied how each intermediate component contributes to the model performance, which poses significant challenges for designing optimal architectures. In this work, we bridge this gap by evaluating the impact of individual component (sub-layer) in trained Transformer models from different perspectives. Experimental results across language pairs, training strategies, and model capacities show that certain components are consistently more important than the others. We also report a number of interesting findings that might help humans better analyze, understand and improve Transformer models. Based on these observations, we further propose a new training strategy that can improves translation performance by distinguishing the unimportant components in training.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Attention Is All You Need · Byte Pair Encoding · Dropout · Softmax · Multi-Head Attention