Improving BERT with Hybrid Pooling Network and Drop Mask

Qian Chen; Wen Wang; Qinglin Zhang; Chong Deng; Ma Yukun; Siqi Zheng

arXiv:2307.07258·cs.CL·July 17, 2023

Improving BERT with Hybrid Pooling Network and Drop Mask

Qian Chen, Wen Wang, Qinglin Zhang, Chong Deng, Ma Yukun, Siqi Zheng

PDF

Open Access

TL;DR

This paper introduces HybridBERT, combining self-attention and pooling networks for improved contextual encoding, and DropMask to mitigate pre-training and fine-tuning mismatch, resulting in better performance and efficiency.

Contribution

HybridBERT's novel architecture integrates pooling networks with self-attention, and DropMask addresses pre-training/fine-tuning mismatch, enhancing BERT's effectiveness and efficiency.

Findings

01

HybridBERT outperforms BERT in pre-training loss and transfer tasks.

02

HybridBERT achieves 8% faster training and 13% lower memory usage.

03

DropMask improves downstream task accuracy across masking rates.

Abstract

Transformer-based pre-trained language models, such as BERT, achieve great success in various natural language understanding tasks. Prior research found that BERT captures a rich hierarchy of linguistic information at different layers. However, the vanilla BERT uses the same self-attention mechanism for each layer to model the different contextual features. In this paper, we propose a HybridBERT model which combines self-attention and pooling networks to encode different contextual features in each layer. Additionally, we propose a simple DropMask method to address the mismatch between pre-training and fine-tuning caused by excessive use of special mask tokens during Masked Language Modeling pre-training. Experiments show that HybridBERT outperforms BERT in pre-training with lower loss, faster training speed (8% relative), lower memory cost (13% relative), and also in transfer learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Linear Warmup With Linear Decay · Residual Connection · Adam · Dense Connections · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?