Learning Dynamic BERT via Trainable Gate Variables and a Bi-modal Regularizer
Seohyeong Jeong, Nojun Kwak

TL;DR
This paper introduces a dynamic BERT inference approach using trainable gate variables and a bi-modal regularizer, reducing computational costs while maintaining performance, suitable for resource-limited devices.
Contribution
It proposes a novel dynamic inference method for BERT with trainable gates and a bi-modal regularizer, enabling adjustable trade-offs between accuracy and efficiency.
Findings
Reduced computational cost on GLUE dataset
Minimal performance drop with dynamic inference
Model adjusts performance-cost trade-off via hyperparameter
Abstract
The BERT model has shown significant success on various natural language processing tasks. However, due to the heavy model size and high computational cost, the model suffers from high latency, which is fatal to its deployments on resource-limited devices. To tackle this problem, we propose a dynamic inference method on BERT via trainable gate variables applied on input tokens and a regularizer that has a bi-modal property. Our method shows reduced computational cost on the GLUE dataset with a minimal performance drop. Moreover, the model adjusts with a trade-off between performance and computational cost with the user-specified hyperparameter.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Linear Warmup With Linear Decay · Softmax · Adam · Multi-Head Attention · Attention Dropout · Weight Decay · Residual Connection · Attention Is All You Need · Dropout
