Abrupt Learning in Transformers: A Case Study on Matrix Completion

Pulkit Gopalani; Ekdeep Singh Lubana; Wei Hu

arXiv:2410.22244·cs.LG·October 30, 2024

Abrupt Learning in Transformers: A Case Study on Matrix Completion

Pulkit Gopalani, Ekdeep Singh Lubana, Wei Hu

PDF

Open Access 1 Video

TL;DR

This paper investigates the abrupt loss drops during Transformer training by analyzing a matrix completion task, revealing a transition from copying to accurate prediction, with interpretable attention and embeddings.

Contribution

It demonstrates the phenomenon of sudden loss drops in Transformers and provides interpretability insights into the model's internal transition during training.

Findings

01

Loss curves show a plateau followed by a sharp drop.

02

Attention heads develop interpretable patterns after the drop.

03

Embeddings encode task-relevant information.

Abstract

Recent analysis on the training dynamics of Transformers has unveiled an interesting characteristic: the training loss plateaus for a significant number of training steps, and then suddenly (and sharply) drops to near--optimal values. To understand this phenomenon in depth, we formulate the low-rank matrix completion problem as a masked language modeling (MLM) task, and show that it is possible to train a BERT model to solve this task to low error. Furthermore, the loss curve shows a plateau early in training followed by a sudden drop to near-optimal values, despite no changes in the training procedure or hyper-parameters. To gain interpretability insights into this sudden drop, we examine the model's predictions, attention heads, and hidden states before and after this transition. Concretely, we observe that (a) the model transitions from simply copying the masked input to accurately…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Abrupt Learning in Transformers: A Case Study on Matrix Completion· slideslive

Taxonomy

TopicsComputability, Logic, AI Algorithms · Neural Networks and Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Adam · Linear Layer · Attention Dropout · Dropout · Weight Decay · Dense Connections · Layer Normalization · Residual Connection