DecBERT: Enhancing the Language Understanding of BERT with Causal Attention Masks
Ziyang Luo, Yadong Xi, Jing Ma, Zhiwei Yang, Xiaoxi Mao, Changjie Fan,, Rongsheng Zhang

TL;DR
DecBERT introduces causal attention masks to improve BERT's understanding of word order, achieving comparable or better performance on GLUE tasks without position embeddings and accelerating pre-training.
Contribution
The paper proposes DecBERT, a novel BERT variant using causal attention masks to enhance position encoding and pre-training efficiency.
Findings
Causal attention masks improve BERT's performance on language understanding tasks.
DecBERT without position embeddings performs comparably to standard BERT.
The modification accelerates pre-training and yields better results with the same resources.
Abstract
Since 2017, the Transformer-based models play critical roles in various downstream Natural Language Processing tasks. However, a common limitation of the attention mechanism utilized in Transformer Encoder is that it cannot automatically capture the information of word order, so explicit position embeddings are generally required to be fed into the target model. In contrast, Transformer Decoder with the causal attention masks is naturally sensitive to the word order. In this work, we focus on improving the position encoding ability of BERT with the causal attention masks. Furthermore, we propose a new pre-trained language model DecBERT and evaluate it on the GLUE benchmark. Experimental results show that (1) the causal attention mask is effective for BERT on the language understanding tasks; (2) our DecBERT model without position embeddings achieve comparable performance on the GLUE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Adam · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections
