Vacaspati: A Diverse Corpus of Bangla Literature
Pramit Bhattacharyya, Joydeep Mondal, Subhadip Maji, Arnab, Bhattacharya

TL;DR
Vacaspati is a comprehensive Bangla literature corpus that enables improved NLP models, including a lightweight Electra model, demonstrating better performance on downstream tasks compared to models from other corpora.
Contribution
The paper introduces Vacaspati, a diverse and large-scale Bangla literature corpus, and develops efficient NLP models trained on it, advancing Bangla language processing.
Findings
Vacaspati contains over 11 million sentences and 115 million words.
Vac-BERT performs better or similarly to larger models on downstream tasks.
Models trained on Vacaspati outperform those from other corpora.
Abstract
Bangla (or Bengali) is the fifth most spoken language globally; yet, the state-of-the-art NLP in Bangla is lagging for even simple tasks such as lemmatization, POS tagging, etc. This is partly due to lack of a varied quality corpus. To alleviate this need, we build Vacaspati, a diverse corpus of Bangla literature. The literary works are collected from various websites; only those works that are publicly available without copyright violations or restrictions are collected. We believe that published literature captures the features of a language much better than newspapers, blogs or social media posts which tend to follow only a certain literary pattern and, therefore, miss out on language variety. Our corpus Vacaspati is varied from multiple aspects, including type of composition, topic, author, time, space, etc. It contains more than 11 million sentences and 115 million words. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Layer Normalization · Attention Dropout · Weight Decay · Adam · Dense Connections
