TrimBERT: Tailoring BERT for Trade-offs
Sharath Nittur Sridhar, Anthony Sarah, Sairam Sundaresan

TL;DR
TrimBERT introduces a streamlined version of BERT by reducing intermediate layers and simplifying operations, achieving comparable accuracy with less training time and computational resources.
Contribution
The paper proposes a novel BERT variant that reduces layers and replaces softmax, significantly decreasing training time while maintaining accuracy.
Findings
Minimal accuracy loss with fewer layers
Reduced training time and model size
Effective simplification of self-attention operations
Abstract
Models based on BERT have been extremely successful in solving a variety of natural language processing (NLP) tasks. Unfortunately, many of these large models require a great deal of computational resources and/or time for pre-training and fine-tuning which limits wider adoptability. While self-attention layers have been well-studied, a strong justification for inclusion of the intermediate layers which follow them remains missing in the literature. In this work, we show that reducing the number of intermediate layers in BERT-Base results in minimal fine-tuning accuracy loss of downstream tasks while significantly decreasing model size and training time. We further mitigate two key bottlenecks, by replacing all softmax operations in the self-attention layers with a computationally simpler alternative and removing half of all layernorm operations. This further decreases the training time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
MethodsAttention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Residual Connection · Layer Normalization · Dropout · Attention Dropout · Dense Connections · Weight Decay
