TrimBERT: Tailoring BERT for Trade-offs

Sharath Nittur Sridhar; Anthony Sarah; Sairam Sundaresan

arXiv:2202.12411·cs.CL·February 28, 2022·1 cites

TrimBERT: Tailoring BERT for Trade-offs

Sharath Nittur Sridhar, Anthony Sarah, Sairam Sundaresan

PDF

Open Access

TL;DR

TrimBERT introduces a streamlined version of BERT by reducing intermediate layers and simplifying operations, achieving comparable accuracy with less training time and computational resources.

Contribution

The paper proposes a novel BERT variant that reduces layers and replaces softmax, significantly decreasing training time while maintaining accuracy.

Findings

01

Minimal accuracy loss with fewer layers

02

Reduced training time and model size

03

Effective simplification of self-attention operations

Abstract

Models based on BERT have been extremely successful in solving a variety of natural language processing (NLP) tasks. Unfortunately, many of these large models require a great deal of computational resources and/or time for pre-training and fine-tuning which limits wider adoptability. While self-attention layers have been well-studied, a strong justification for inclusion of the intermediate layers which follow them remains missing in the literature. In this work, we show that reducing the number of intermediate layers in BERT-Base results in minimal fine-tuning accuracy loss of downstream tasks while significantly decreasing model size and training time. We further mitigate two key bottlenecks, by replacing all softmax operations in the self-attention layers with a computationally simpler alternative and removing half of all layernorm operations. This further decreases the training time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification

MethodsAttention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Residual Connection · Layer Normalization · Dropout · Attention Dropout · Dense Connections · Weight Decay