Probing for the Usage of Grammatical Number
Karim Lasri, Tiago Pimentel, Alessandro Lenci, Thierry Poibeau, Ryan, Cotterell

TL;DR
This paper introduces a usage-based probing method to determine whether pre-trained models like BERT genuinely rely on specific linguistic properties, such as grammatical number, for their predictions.
Contribution
It proposes a novel probing setup that assesses the actual usage of linguistic encodings by intervening on model representations and measuring performance impact.
Findings
BERT relies on a linear encoding of grammatical number.
BERT uses separate encodings for nouns and verbs.
Information transfer of grammatical number occurs in specific layers.
Abstract
A central quest of probing is to uncover how pre-trained models encode a linguistic property within their representations. An encoding, however, might be spurious-i.e., the model might not rely on it when making predictions. In this paper, we try to find encodings that the model actually uses, introducing a usage-based probing setup. We first choose a behavioral task which cannot be solved without using the linguistic property. Then, we attempt to remove the property by intervening on the model's representations. We contend that, if an encoding is used by the model, its removal should harm the performance on the chosen behavioral task. As a case study, we focus on how BERT encodes grammatical number, and on how it uses this encoding to solve the number agreement task. Experimentally, we find that BERT relies on a linear encoding of grammatical number to produce the correct behavioral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Attention Dropout · Softmax · Dropout · Layer Normalization
