Incorporating Residual and Normalization Layers into Analysis of Masked   Language Models

Goro Kobayashi; Tatsuki Kuribayashi; Sho Yokoi; Kentaro Inui

arXiv:2109.07152·cs.CL·September 16, 2021

Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, Kentaro Inui

PDF

Open Access 2 Repos

TL;DR

This paper broadens the analysis of Transformer-based masked language models by including residual and normalization layers, revealing that attention patterns are less critical to intermediate representations than previously thought.

Contribution

It extends the analysis of Transformers beyond attention patterns to include residual and normalization layers, offering new insights into their roles.

Findings

01

Attention patterns have less impact on intermediate representations than assumed.

02

Disregarding learned attention patterns does not significantly harm model performance.

03

Residual and normalization layers contribute to Transformer behavior beyond attention.

Abstract

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only composed of the multi-head attention; other components can also contribute to Transformers' progressive performance. In this study, we extended the scope of the analysis of Transformers from solely the attention patterns to the whole attention block, i.e., multi-head attention, residual connection, and layer normalization. Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed. These results provide new intuitive explanations of existing reports; for example, discarding the learned attention patterns tends not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Label Smoothing · Adam · Residual Connection · Multi-Head Attention · Softmax