Compositional and Lexical Semantics in RoBERTa, BERT and DistilBERT: A Case Study on CoQA
Ieva Stali\=unait\.e, Ignacio Iacobacci

TL;DR
This study analyzes how well RoBERTa, BERT, and DistilBERT capture linguistic phenomena in conversational question answering, revealing their strengths and weaknesses in representing compositional and lexical semantics.
Contribution
It systematically identifies linguistic knowledge gaps in these models and demonstrates performance improvements through multitask learning and ensembling techniques.
Findings
Enhanced models improve F1 scores by 2.2 to 2.7 points.
Ensembles boost F1 by up to 42.1 points on difficult questions.
Differences in linguistic capabilities are observed among RoBERTa, BERT, and DistilBERT.
Abstract
Many NLP tasks have benefited from transferring knowledge from contextualized word embeddings, however the picture of what type of knowledge is transferred is incomplete. This paper studies the types of linguistic phenomena accounted for by language models in the context of a Conversational Question Answering (CoQA) task. We identify the problematic areas for the finetuned RoBERTa, BERT and DistilBERT models through systematic error analysis - basic arithmetic (counting phrases), compositional semantics (negation and Semantic Role Labeling), and lexical semantics (surprisal and antonymy). When enhanced with the relevant linguistic knowledge through multitask learning, the models improve in performance. Ensembles of the enhanced models yield a boost between 2.2 and 2.7 points in F1 score overall, and up to 42.1 points in F1 on the hardest question classes. The results show differences in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Softmax · Layer Normalization · Weight Decay · Dropout · Linear Warmup With Linear Decay · RoBERTa · Dense Connections · Attention Dropout · WordPiece
