Abrupt Learning in Transformers: A Case Study on Matrix Completion
Pulkit Gopalani, Ekdeep Singh Lubana, Wei Hu

TL;DR
This paper investigates the abrupt loss drops during Transformer training by analyzing a matrix completion task, revealing a transition from copying to accurate prediction, with interpretable attention and embeddings.
Contribution
It demonstrates the phenomenon of sudden loss drops in Transformers and provides interpretability insights into the model's internal transition during training.
Findings
Loss curves show a plateau followed by a sharp drop.
Attention heads develop interpretable patterns after the drop.
Embeddings encode task-relevant information.
Abstract
Recent analysis on the training dynamics of Transformers has unveiled an interesting characteristic: the training loss plateaus for a significant number of training steps, and then suddenly (and sharply) drops to near--optimal values. To understand this phenomenon in depth, we formulate the low-rank matrix completion problem as a masked language modeling (MLM) task, and show that it is possible to train a BERT model to solve this task to low error. Furthermore, the loss curve shows a plateau early in training followed by a sudden drop to near-optimal values, despite no changes in the training procedure or hyper-parameters. To gain interpretability insights into this sudden drop, we examine the model's predictions, attention heads, and hidden states before and after this transition. Concretely, we observe that (a) the model transitions from simply copying the masked input to accurately…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsComputability, Logic, AI Algorithms · Neural Networks and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Adam · Linear Layer · Attention Dropout · Dropout · Weight Decay · Dense Connections · Layer Normalization · Residual Connection
