BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance
R. Thomas McCoy, Junghyun Min, Tal Linzen

TL;DR
This study reveals that multiple instances of the same neural network architecture trained on the same data can have highly variable generalization behaviors, despite similar performance on the training set.
Contribution
It demonstrates significant variability in generalization across models with similar test accuracy, highlighting the influence of local minima and the need for stronger inductive biases.
Findings
Models show consistent accuracy on MNLI development set.
Wide variability in syntactic generalization performance on HANS.
Variability likely due to local minima in training landscape.
Abstract
If the same neural network architecture is trained multiple times on the same dataset, will it make similar linguistic generalizations across runs? To study this question, we fine-tuned 100 instances of BERT on the Multi-genre Natural Language Inference (MNLI) dataset and evaluated them on the HANS dataset, which evaluates syntactic generalization in natural language inference. On the MNLI development set, the behavior of all instances was remarkably consistent, with accuracy ranging between 83.6% and 84.8%. In stark contrast, the same models varied widely in their generalization performance. For example, on the simple case of subject-object swap (e.g., determining that "the doctor visited the lawyer" does not entail "the lawyer visited the doctor"), accuracy ranged from 0.00% to 66.2%. Such variation is likely due to the presence of many local minima that are equally attractive to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax
