On the Role of Bidirectionality in Language Model Pre-Training

Mikel Artetxe; Jingfei Du; Naman Goyal; Luke Zettlemoyer; Ves Stoyanov

arXiv:2205.11726·cs.CL·October 27, 2022·1 cites

On the Role of Bidirectionality in Language Model Pre-Training

Mikel Artetxe, Jingfei Du, Naman Goyal, Luke Zettlemoyer, Ves Stoyanov

PDF

Open Access

TL;DR

This paper investigates the impact of bidirectionality in language model pre-training, proposing a unified framework to compare different architectures and analyzing their performance across various NLP tasks.

Contribution

It introduces a generalized framework controlling bidirectional context and attention, providing insights into their effects on model performance and application suitability.

Findings

01

Bidirectional attention benefits fine-tuning and infilling.

02

Unidirectional models excel in next token prediction and zero-shot tasks.

03

Optimal bidirectionality configuration depends on specific application.

Abstract

Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, including fully unidirectional models like GPT, fully bidirectional models like BERT, and hybrid models like CM3 and prefix LM. Our framework distinguishes between two notions of bidirectionality (bidirectional context and bidirectional attention) and allows us to control each of them separately. We find that the optimal configuration is largely application-dependent (e.g., bidirectional attention is beneficial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Linear Warmup With Linear Decay · Dense Connections · Dropout · Cosine Annealing · WordPiece · Discriminative Fine-Tuning