BERMo: What can BERT learn from ELMo?
Sangamesh Kodge, Kaushik Roy

TL;DR
BERMo enhances BERT by integrating hierarchical surface, syntactic, and semantic features using a simple combination scheme, leading to better accuracy, faster convergence, and improved parameter efficiency with minimal added complexity.
Contribution
The paper introduces BERMo, a novel modification to BERT that incorporates hierarchical features via a linear combination scheme inspired by ELMo, improving gradient flow and representation power.
Findings
Up to 4.65% accuracy improvement on SentEval semantic tasks.
Faster convergence by 1.67x and 1.15x on MNLI and QQP.
Enables stable pruning for small datasets like SST-2.
Abstract
We propose BERMo, an architectural modification to BERT, which makes predictions based on a hierarchy of surface, syntactic and semantic language features. We use linear combination scheme proposed in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths. Our approach has two-fold benefits: (1) improved gradient flow for the downstream task as every layer has a direct connection to the gradients of the loss function and (2) increased representative power as the model no longer needs to copy the features learned in the shallower layer which are necessary for the downstream task. Further, our model has a negligible parameter overhead as there is a single scalar parameter associated with each layer in the network. Experiments on the probing task from SentEval dataset show that our model performs up to better in accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Pruning · Linear Layer · Layer Normalization · Dense Connections · Residual Connection · Adam · Multi-Head Attention · Linear Warmup With Linear Decay · Refunds@Expedia|||How do I get a full refund from Expedia?
