Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with   Eight Topics and Five Attributes

Istiak Ahmad; Fahad AlQurashi; Rashid Mehmood

arXiv:2210.09389·cs.CL·October 19, 2022·6 cites

Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes

Istiak Ahmad, Fahad AlQurashi, Rashid Mehmood

PDF

Open Access

TL;DR

Potrika is the largest Bangla news dataset, offering raw and balanced versions with detailed attributes, enabling extensive NLP research in a low-resource language.

Contribution

This paper introduces Potrika, the first large-scale, multi-attribute Bangla news dataset with both raw and balanced versions for NLP research.

Findings

01

Contains 664,880 articles with 185.51 million words

02

Provides balanced dataset with 320,000 articles across 8 categories

03

Enables diverse NLP applications in Bangla language

Abstract

Knowledge is central to human and scientific developments. Natural Language Processing (NLP) allows automated analysis and creation of knowledge. Data is a crucial NLP and machine learning ingredient. The scarcity of open datasets is a well-known problem in machine and deep learning research. This is very much the case for textual NLP datasets in English and other major world languages. For the Bangla language, the situation is even more challenging and the number of large datasets for NLP research is practically nil. We hereby present Potrika, a large single-label Bangla news article textual dataset curated for NLP research from six popular online news portals in Bangladesh (Jugantor, Jaijaidin, Ittefaq, Kaler Kontho, Inqilab, and Somoyer Alo) for the period 2014-2020. The articles are classified into eight distinct categories (National, Sports, International, Entertainment, Economy,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Machine Learning and Data Classification · Natural Language Processing Techniques