# What BERT is not: Lessons from a new suite of psycholinguistic   diagnostics for language models

**Authors:** Allyson Ettinger

arXiv: 1907.13528 · 2020-07-14

## TL;DR

This paper introduces a set of psycholinguistic diagnostics to evaluate what linguistic capabilities language models like BERT possess, revealing strengths in hypernym retrieval but limitations in negation and complex inference.

## Contribution

It presents a novel suite of diagnostics inspired by human language experiments to assess the linguistic understanding of language models like BERT.

## Key findings

- BERT can distinguish good from bad completions involving shared categories.
- BERT retrieves noun hypernyms effectively.
- BERT struggles with negation and complex inference.

## Abstract

Pre-training by language modeling has become a popular and successful approach to NLP tasks, but we have yet to understand exactly what linguistic capacities these pre-training processes confer upon models. In this paper we introduce a suite of diagnostics drawn from human language experiments, which allow us to ask targeted questions about the information used by language models for generating predictions in context. As a case study, we apply these diagnostics to the popular BERT model, finding that it can generally distinguish good from bad completions involving shared category or role reversal, albeit with less sensitivity than humans, and it robustly retrieves noun hypernyms, but it struggles with challenging inferences and role-based event prediction -- and in particular, it shows clear insensitivity to the contextual impacts of negation.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.13528/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/1907.13528/full.md

---
Source: https://tomesphere.com/paper/1907.13528