CAMeMBERT: Cascading Assistant-Mediated Multilingual BERT

Dan DeGenaro; Jugal Kalita

arXiv:2212.11456·cs.CL·December 23, 2022

CAMeMBERT: Cascading Assistant-Mediated Multilingual BERT

Dan DeGenaro, Jugal Kalita

PDF

Open Access

TL;DR

CAMeMBERT introduces a knowledge distillation approach to create a more efficient multilingual BERT model, reducing resource requirements while maintaining acceptable accuracy levels for NLP tasks.

Contribution

The paper presents CAMeMBERT, a novel cascading distillation method that enhances multilingual BERT's efficiency with minimal accuracy loss.

Findings

01

Achieves around 60.1% accuracy on NLP tasks.

02

Reduces time and space complexity compared to original mBERT.

03

Uses a cascading distillation process with teacher assistant networks.

Abstract

Large language models having hundreds of millions, and even billions, of parameters have performed extremely well on a variety of natural language processing (NLP) tasks. Their widespread use and adoption, however, is hindered by the lack of availability and portability of sufficiently large computational resources. This paper proposes a knowledge distillation (KD) technique building on the work of LightMBERT, a student model of multilingual BERT (mBERT). By repeatedly distilling mBERT through increasingly compressed toplayer distilled teacher assistant networks, CAMeMBERT aims to improve upon the time and space complexities of mBERT while keeping loss of accuracy beneath an acceptable threshold. At present, CAMeMBERT has an average accuracy of around 60.1%, which is subject to change after future improvements to the hyperparameters used in fine-tuning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Attention Dropout · Residual Connection · Weight Decay · Dropout · Linear Warmup With Linear Decay · Linear Layer · Adam