Compositional and Lexical Semantics in RoBERTa, BERT and DistilBERT: A   Case Study on CoQA

Ieva Stali\=unait\.e; Ignacio Iacobacci

arXiv:2009.08257·cs.CL·September 18, 2020

Compositional and Lexical Semantics in RoBERTa, BERT and DistilBERT: A Case Study on CoQA

Ieva Stali\=unait\.e, Ignacio Iacobacci

PDF

TL;DR

This study analyzes how well RoBERTa, BERT, and DistilBERT capture linguistic phenomena in conversational question answering, revealing their strengths and weaknesses in representing compositional and lexical semantics.

Contribution

It systematically identifies linguistic knowledge gaps in these models and demonstrates performance improvements through multitask learning and ensembling techniques.

Findings

01

Enhanced models improve F1 scores by 2.2 to 2.7 points.

02

Ensembles boost F1 by up to 42.1 points on difficult questions.

03

Differences in linguistic capabilities are observed among RoBERTa, BERT, and DistilBERT.

Abstract

Many NLP tasks have benefited from transferring knowledge from contextualized word embeddings, however the picture of what type of knowledge is transferred is incomplete. This paper studies the types of linguistic phenomena accounted for by language models in the context of a Conversational Question Answering (CoQA) task. We identify the problematic areas for the finetuned RoBERTa, BERT and DistilBERT models through systematic error analysis - basic arithmetic (counting phrases), compositional semantics (negation and Semantic Role Labeling), and lexical semantics (surprisal and antonymy). When enhanced with the relevant linguistic knowledge through multitask learning, the models improve in performance. Ensembles of the enhanced models yield a boost between 2.2 and 2.7 points in F1 score overall, and up to 42.1 points in F1 on the hardest question classes. The results show differences in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Softmax · Layer Normalization · Weight Decay · Dropout · Linear Warmup With Linear Decay · RoBERTa · Dense Connections · Attention Dropout · WordPiece