Improving BERT with Hybrid Pooling Network and Drop Mask
Qian Chen, Wen Wang, Qinglin Zhang, Chong Deng, Ma Yukun, Siqi Zheng

TL;DR
This paper introduces HybridBERT, combining self-attention and pooling networks for improved contextual encoding, and DropMask to mitigate pre-training and fine-tuning mismatch, resulting in better performance and efficiency.
Contribution
HybridBERT's novel architecture integrates pooling networks with self-attention, and DropMask addresses pre-training/fine-tuning mismatch, enhancing BERT's effectiveness and efficiency.
Findings
HybridBERT outperforms BERT in pre-training loss and transfer tasks.
HybridBERT achieves 8% faster training and 13% lower memory usage.
DropMask improves downstream task accuracy across masking rates.
Abstract
Transformer-based pre-trained language models, such as BERT, achieve great success in various natural language understanding tasks. Prior research found that BERT captures a rich hierarchy of linguistic information at different layers. However, the vanilla BERT uses the same self-attention mechanism for each layer to model the different contextual features. In this paper, we propose a HybridBERT model which combines self-attention and pooling networks to encode different contextual features in each layer. Additionally, we propose a simple DropMask method to address the mismatch between pre-training and fine-tuning caused by excessive use of special mask tokens during Masked Language Modeling pre-training. Experiments show that HybridBERT outperforms BERT in pre-training with lower loss, faster training speed (8% relative), lower memory cost (13% relative), and also in transfer learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Linear Warmup With Linear Decay · Residual Connection · Adam · Dense Connections · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?
