Masked Language Model Scoring

Julian Salazar; Davis Liang; Toan Q. Nguyen; Katrin Kirchhoff

arXiv:1910.14659·cs.CL·January 5, 2021

Masked Language Model Scoring

Julian Salazar, Davis Liang, Toan Q. Nguyen, Katrin Kirchhoff

PDF

5 Repos 4 Models

TL;DR

This paper demonstrates that pseudo-log-likelihood scores from pretrained masked language models can effectively evaluate and improve NLP tasks like speech recognition and translation without additional training.

Contribution

It introduces the use of PLLs from MLMs as a versatile, out-of-the-box scoring method that surpasses autoregressive models and enables plug-and-play evaluation across multiple languages.

Findings

01

PLLs outperform GPT-2 scores in various tasks.

02

RoBERTa reduces WER by 30% and improves BLEU scores.

03

PLLs express linguistic acceptability without left-to-right bias.

Abstract

Pretrained masked language models (MLMs) require finetuning for most NLP tasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihood scores (PLLs), which are computed by masking tokens one by one. We show that PLLs outperform scores from autoregressive language models like GPT-2 in a variety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces an end-to-end LibriSpeech model's WER by 30% relative and adds up to +1.7 BLEU on state-of-the-art baselines for low-resource translation pairs, with further gains from domain adaptation. We attribute this success to PLL's unsupervised expression of linguistic acceptability without a left-to-right bias, greatly improving on scores from GPT-2 (+10 points on island effects, NPI licensing in BLiMP). One can finetune MLMs to give scores without masking, enabling computation in a single inference pass. In all, PLLs and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · GPT-2 · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Byte Pair Encoding · Weight Decay