Learning from Mistakes: Using Mis-predictions as Harm Alerts in Language Pre-Training
Chen Xing, Wenhao Liu, Caiming Xiong

TL;DR
This paper introduces MPA, a novel pre-training method that uses mis-predictions to identify and reduce reliance on dominating co-occurrence patterns, thereby enhancing language understanding and downstream task performance.
Contribution
The paper proposes a new pre-training approach, MPA, which leverages mis-predictions to guide attention and reduce reliance on simple co-occurrence patterns, improving model learning efficiency.
Findings
MPA accelerates BERT and ELECTRA pre-training.
MPA improves downstream task performance.
MPA reduces reliance on dominating co-occurrence patterns.
Abstract
Fitting complex patterns in the training data, such as reasoning and commonsense, is a key challenge for language pre-training. According to recent studies and our empirical observations, one possible reason is that some easy-to-fit patterns in the training data, such as frequently co-occurring word combinations, dominate and harm pre-training, making it hard for the model to fit more complex information. We argue that mis-predictions can help locate such dominating patterns that harm language understanding. When a mis-prediction occurs, there should be frequently co-occurring patterns with the mis-predicted word fitted by the model that lead to the mis-prediction. If we can add regularization to train the model to rely less on such dominating patterns when a mis-prediction occurs and focus more on the rest more subtle patterns, more information can be efficiently fitted at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · WordPiece · Linear Warmup With Linear Decay · Attention Is All You Need · Byte Pair Encoding · Layer Normalization · Dropout · Weight Decay
