On Importance of Layer Pruning for Smaller BERT Models and Low Resource   Languages

Mayur Shirke; Amey Shembade; Madhushri Wagh; Pavan Thorat; Raviraj; Joshi

arXiv:2501.00733·cs.CL·January 3, 2025

On Importance of Layer Pruning for Smaller BERT Models and Low Resource Languages

Mayur Shirke, Amey Shembade, Madhushri Wagh, Pavan Thorat, Raviraj, Joshi

PDF

Open Access

TL;DR

This paper investigates layer pruning in BERT models for low-resource languages, showing that pruned models can match the performance of larger models while reducing complexity and computational costs.

Contribution

It demonstrates that layer pruning, especially from the middle, effectively reduces BERT model size without sacrificing accuracy in low-resource language tasks.

Findings

01

Pruned models perform comparably to full models.

02

Middle-layer pruning is most effective.

03

Monolingual BERT outperforms multilingual models.

Abstract

This study explores the effectiveness of layer pruning for developing more efficient BERT models tailored to specific downstream tasks in low-resource languages. Our primary objective is to evaluate whether pruned BERT models can maintain high performance while reducing model size and complexity. We experiment with several BERT variants, including MahaBERT-v2 and Google-Muril, applying different pruning strategies and comparing their performance to smaller, scratch-trained models like MahaBERT-Small and MahaBERT-Smaller. We fine-tune these models on Marathi datasets, specifically Short Headlines Classification (SHC), Long Paragraph Classification (LPC) and Long Document Classification (LDC), to assess their classification accuracy. Our findings demonstrate that pruned models, despite having fewer layers, achieve comparable performance to their fully-layered counterparts while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems

MethodsAttention Is All You Need · Layer Normalization · Attention Dropout · Linear Layer · Softmax · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · WordPiece · Dropout