Lexical Generalization Improves with Larger Models and Longer Training
Elron Bandel, Yoav Goldberg, Yanai Elazar

TL;DR
This paper demonstrates that larger models and longer training durations reduce reliance on superficial lexical overlap heuristics across various NLP tasks, with the disparity rooted in pre-trained models.
Contribution
It shows that increasing model size and training length diminishes heuristic reliance, highlighting the importance of model scale and training in improving robustness.
Findings
Larger models are less susceptible to lexical overlap heuristics.
Longer training reduces reliance on superficial heuristics.
Disparity between model sizes originates from pre-trained models.
Abstract
While fine-tuned language models perform well on many tasks, they were also shown to rely on superficial surface features such as lexical overlap. Excessive utilization of such heuristics can lead to failure on challenging inputs. We analyze the use of lexical overlap heuristics in natural language inference, paraphrase detection, and reading comprehension (using a novel contrastive dataset), and find that larger models are much less susceptible to adopting lexical overlap heuristics. We also find that longer training leads models to abandon lexical overlap heuristics. Finally, we provide evidence that the disparity between models size has its source in the pre-trained model
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
