Linguistic Analysis of Pretrained Sentence Encoders with Acceptability Judgments
Alex Warstadt, Samuel R. Bowman

TL;DR
This paper introduces a comprehensive dataset for analyzing the grammatical knowledge of pretrained sentence encoders across a wide range of linguistic phenomena, revealing their strengths and limitations in understanding complex syntax.
Contribution
It provides a new analysis dataset covering 13 syntactic phenomena and evaluates popular encoders, highlighting their capabilities and shortcomings in grammatical understanding.
Findings
Models excel at argument structures like passives and ditransitives.
Long-distance dependencies remain challenging for all models.
BERT and GPT outperform baseline in complex syntactic tasks.
Abstract
Recent work on evaluating grammatical knowledge in pretrained sentence encoders gives a fine-grained view of a small number of phenomena. We introduce a new analysis dataset that also has broad coverage of linguistic phenomena. We annotate the development set of the Corpus of Linguistic Acceptability (CoLA; Warstadt et al., 2018) for the presence of 13 classes of syntactic phenomena including various forms of argument alternations, movement, and modification. We use this analysis set to investigate the grammatical knowledge of three pretrained encoders: BERT (Devlin et al., 2018), GPT (Radford et al., 2018), and the BiLSTM baseline from Warstadt et al. We find that these models have a strong command of complex or non-canonical argument structures like ditransitives (Sue gave Dan a book) and passives (The book was read). Sentences with long distance dependencies like questions (What do…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Hate Speech and Cyberbullying Detection
MethodsLinear Layer · Cosine Annealing · Sigmoid Activation · Tanh Activation · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Dense Connections
