BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource   Language Understanding Evaluation in Bangla

Abhik Bhattacharjee; Tahmid Hasan; Wasi Uddin Ahmad; Kazi Samin; Md; Saiful Islam; Anindya Iqbal; M. Sohel Rahman; Rifat Shahriyar

arXiv:2101.00204·cs.CL·May 11, 2022·6 cites

BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla

Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Kazi Samin, Md, Saiful Islam, Anindya Iqbal, M. Sohel Rahman, Rifat Shahriyar

PDF

Open Access 1 Repo 5 Models 3 Datasets

TL;DR

BanglaBERT is a pretrained language model for Bangla that achieves state-of-the-art results on multiple NLU tasks, supported by new datasets and benchmarks for this low-resource language.

Contribution

We introduce BanglaBERT, the first dedicated BERT-based model for Bangla, along with new datasets and the BLUB benchmark to advance NLP research in this low-resource language.

Findings

01

BanglaBERT outperforms multilingual and monolingual models on NLU tasks.

02

We created the first comprehensive Bangla NLU benchmark (BLUB).

03

Models, datasets, and leaderboard are publicly available.

Abstract

In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at https://github.com/csebuetnlp/banglabert to advance Bangla NLP.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

csebuetnlp/banglabert
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications