How to Dissect a Muppet: The Structure of Transformer Embedding Spaces

Timothee Mickus; Denis Paperno; Mathieu Constant

arXiv:2206.03529·cs.CL·June 9, 2022

How to Dissect a Muppet: The Structure of Transformer Embedding Spaces

Timothee Mickus, Denis Paperno, Mathieu Constant

PDF

Open Access

TL;DR

This paper introduces a mathematical framework to analyze Transformer embeddings, revealing how different components contribute to their structure and impact on downstream tasks, and examining effects of finetuning.

Contribution

It presents a novel mathematical decomposition of Transformer embeddings, enabling detailed analysis of component roles and their influence on downstream applications.

Findings

01

Multi-head attention and feed-forward components vary in usefulness across tasks.

02

Finetuning significantly alters the embedding space.

03

Connections established between embedding structure and previous vector space studies.

Abstract

Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational Physics and Python Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Adam · Label Smoothing · Softmax · Byte Pair Encoding · Dropout · Residual Connection