How to Dissect a Muppet: The Structure of Transformer Embedding Spaces
Timothee Mickus, Denis Paperno, Mathieu Constant

TL;DR
This paper introduces a mathematical framework to analyze Transformer embeddings, revealing how different components contribute to their structure and impact on downstream tasks, and examining effects of finetuning.
Contribution
It presents a novel mathematical decomposition of Transformer embeddings, enabling detailed analysis of component roles and their influence on downstream applications.
Findings
Multi-head attention and feed-forward components vary in usefulness across tasks.
Finetuning significantly alters the embedding space.
Connections established between embedding structure and previous vector space studies.
Abstract
Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational Physics and Python Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Adam · Label Smoothing · Softmax · Byte Pair Encoding · Dropout · Residual Connection
