On Losses for Modern Language Models

Stephane Aroca-Ouellette; Frank Rudzicz

arXiv:2010.01694·cs.CL·October 6, 2020

On Losses for Modern Language Models

Stephane Aroca-Ouellette, Frank Rudzicz

PDF

1 Repo

TL;DR

This paper critically examines BERT's pre-training tasks, especially NSP, introduces new auxiliary tasks, and shows that multi-task pre-training enhances performance, outperforming BERT on GLUE with less data.

Contribution

The paper clarifies NSP's negative impact, proposes seven novel auxiliary tasks, and demonstrates multi-task pre-training's effectiveness in improving language model performance.

Findings

01

NSP is detrimental due to context splitting.

02

Seven new auxiliary tasks outperform MLM baseline.

03

Multi-task pre-training yields better results than single tasks.

Abstract

BERT set many state-of-the-art results over varied NLU benchmarks by pre-training over two tasks: masked language modelling (MLM) and next sentence prediction (NSP), the latter of which has been highly criticized. In this paper, we 1) clarify NSP's effect on BERT pre-training, 2) explore fourteen possible auxiliary pre-training tasks, of which seven are novel to modern language models, and 3) investigate different ways to include multiple tasks into pre-training. We show that NSP is detrimental to training due to its context splitting and shallow semantic signal. We also identify six auxiliary pre-training tasks -- sentence ordering, adjacent sentence prediction, TF prediction, TF-IDF prediction, a FastSent variant, and a Quick Thoughts variant -- that outperform a pure MLM baseline. Finally, we demonstrate that using multiple tasks in a multi-task pre-training framework provides better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

StephAO/olfmlm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Dense Connections · Layer Normalization · WordPiece · Multi-Head Attention · Dropout · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Attention Is All You Need