Better Pre-Training by Reducing Representation Confusion

Haojie Zhang; Mingfei Liang; Ruobing Xie; Zhenlong Sun; Bo Zhang; Leyu; Lin

arXiv:2210.04246·cs.CL·February 10, 2023·1 cites

Better Pre-Training by Reducing Representation Confusion

Haojie Zhang, Mingfei Liang, Ruobing Xie, Zhenlong Sun, Bo Zhang, Leyu, Lin

PDF

Open Access

TL;DR

This paper identifies information confusion issues in Transformer-based language models and proposes two techniques, DDRP and MTH, to improve their ability to distinguish different information types, leading to better performance.

Contribution

The paper introduces DDRP encoding and MTH pre-training objectives to reduce representation confusion in pre-trained language models, enhancing their ability to capture diverse information.

Findings

01

Improved GLUE benchmark performance with proposed methods

02

Decoupling position features enhances semantic understanding

03

Regularizers increase diversity of token and head representations

Abstract

In this work, we revisit the Transformer-based pre-trained language models and identify two different types of information confusion in position encoding and model representations, respectively. Firstly, we show that in the relative position encoding, the joint modeling about relative distances and directions brings confusion between two heterogeneous information. It may make the model unable to capture the associative semantics of the same distance and the opposite directions, which in turn affects the performance of downstream tasks. Secondly, we notice the BERT with Mask Language Modeling (MLM) pre-training objective outputs similar token representations (last hidden states of different tokens) and head representations (attention weights of different heads), which may make the diversity of information expressed by different tokens and heads limited. Motivated by the above…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsGated Linear Unit · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Weight Decay · Softmax · Inverse Square Root Schedule · Linear Warmup With Linear Decay