Better Pre-Training by Reducing Representation Confusion
Haojie Zhang, Mingfei Liang, Ruobing Xie, Zhenlong Sun, Bo Zhang, Leyu, Lin

TL;DR
This paper identifies information confusion issues in Transformer-based language models and proposes two techniques, DDRP and MTH, to improve their ability to distinguish different information types, leading to better performance.
Contribution
The paper introduces DDRP encoding and MTH pre-training objectives to reduce representation confusion in pre-trained language models, enhancing their ability to capture diverse information.
Findings
Improved GLUE benchmark performance with proposed methods
Decoupling position features enhances semantic understanding
Regularizers increase diversity of token and head representations
Abstract
In this work, we revisit the Transformer-based pre-trained language models and identify two different types of information confusion in position encoding and model representations, respectively. Firstly, we show that in the relative position encoding, the joint modeling about relative distances and directions brings confusion between two heterogeneous information. It may make the model unable to capture the associative semantics of the same distance and the opposite directions, which in turn affects the performance of downstream tasks. Secondly, we notice the BERT with Mask Language Modeling (MLM) pre-training objective outputs similar token representations (last hidden states of different tokens) and head representations (attention weights of different heads), which may make the diversity of information expressed by different tokens and heads limited. Motivated by the above…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsGated Linear Unit · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Weight Decay · Softmax · Inverse Square Root Schedule · Linear Warmup With Linear Decay
