Frequency Effects on Syntactic Rule Learning in Transformers
Jason Wei, Dan Garrette, Tal Linzen, and Ellie Pavlick

TL;DR
This study examines how BERT learns syntactic rules like subject-verb agreement, revealing that while it can generalize to unseen pairs, its performance is heavily affected by word frequency and training biases.
Contribution
The paper introduces controlled interventions at pre-training to analyze BERT's rule learning, highlighting the influence of frequency effects on syntactic generalization.
Findings
BERT generalizes well to unseen subject-verb pairs.
Word frequency significantly impacts BERT's agreement predictions.
BERT struggles with infrequent lexical items due to training priors.
Abstract
Pre-trained language models perform well on a variety of linguistic tasks that require symbolic reasoning, raising the question of whether such models implicitly represent abstract symbols and rules. We investigate this question using the case study of BERT's performance on English subject-verb agreement. Unlike prior work, we train multiple instances of BERT from scratch, allowing us to perform a series of controlled interventions at pre-training time. We show that BERT often generalizes well to subject-verb pairs that never occurred in training, suggesting a degree of rule-governed behavior. We also find, however, that performance is heavily influenced by word frequency, with experiments showing that both the absolute frequency of a verb form, as well as the frequency relative to the alternate inflection, are causally implicated in the predictions BERT makes at inference time. Closer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Residual Connection · Multi-Head Attention · Softmax
