BERMo: What can BERT learn from ELMo?

Sangamesh Kodge; Kaushik Roy

arXiv:2110.15802·cs.CL·November 1, 2021

BERMo: What can BERT learn from ELMo?

Sangamesh Kodge, Kaushik Roy

PDF

Open Access

TL;DR

BERMo enhances BERT by integrating hierarchical surface, syntactic, and semantic features using a simple combination scheme, leading to better accuracy, faster convergence, and improved parameter efficiency with minimal added complexity.

Contribution

The paper introduces BERMo, a novel modification to BERT that incorporates hierarchical features via a linear combination scheme inspired by ELMo, improving gradient flow and representation power.

Findings

01

Up to 4.65% accuracy improvement on SentEval semantic tasks.

02

Faster convergence by 1.67x and 1.15x on MNLI and QQP.

03

Enables stable pruning for small datasets like SST-2.

Abstract

We propose BERMo, an architectural modification to BERT, which makes predictions based on a hierarchy of surface, syntactic and semantic language features. We use linear combination scheme proposed in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths. Our approach has two-fold benefits: (1) improved gradient flow for the downstream task as every layer has a direct connection to the gradients of the loss function and (2) increased representative power as the model no longer needs to copy the features learned in the shallower layer which are necessary for the downstream task. Further, our model has a negligible parameter overhead as there is a single scalar parameter associated with each layer in the network. Experiments on the probing task from SentEval dataset show that our model performs up to $4.65%$ better in accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Pruning · Linear Layer · Layer Normalization · Dense Connections · Residual Connection · Adam · Multi-Head Attention · Linear Warmup With Linear Decay · Refunds@Expedia|||How do I get a full refund from Expedia?