On the Role of Bidirectionality in Language Model Pre-Training
Mikel Artetxe, Jingfei Du, Naman Goyal, Luke Zettlemoyer, Ves Stoyanov

TL;DR
This paper investigates the impact of bidirectionality in language model pre-training, proposing a unified framework to compare different architectures and analyzing their performance across various NLP tasks.
Contribution
It introduces a generalized framework controlling bidirectional context and attention, providing insights into their effects on model performance and application suitability.
Findings
Bidirectional attention benefits fine-tuning and infilling.
Unidirectional models excel in next token prediction and zero-shot tasks.
Optimal bidirectionality configuration depends on specific application.
Abstract
Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, including fully unidirectional models like GPT, fully bidirectional models like BERT, and hybrid models like CM3 and prefix LM. Our framework distinguishes between two notions of bidirectionality (bidirectional context and bidirectional attention) and allows us to control each of them separately. We find that the optimal configuration is largely application-dependent (e.g., bidirectional attention is beneficial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Linear Warmup With Linear Decay · Dense Connections · Dropout · Cosine Annealing · WordPiece · Discriminative Fine-Tuning
