L3Cube-MahaNews: News-based Short Text and Long Document Classification   Datasets in Marathi

Saloni Mittal; Vidula Magdum; Omkar Dhekane; Sharayu Hiwarkhedkar,; Raviraj Joshi

arXiv:2404.18216·cs.CL·April 30, 2024

L3Cube-MahaNews: News-based Short Text and Long Document Classification Datasets in Marathi

Saloni Mittal, Vidula Magdum, Omkar Dhekane, Sharayu Hiwarkhedkar,, Raviraj Joshi

PDF

1 Repo

TL;DR

This paper introduces L3Cube-MahaNews, the largest Marathi news classification dataset with 12 categories, and evaluates various BERT models, highlighting MahaBERT's superior performance for Marathi text classification.

Contribution

The work provides the first large-scale Marathi news classification datasets across different document lengths and offers baseline results with state-of-the-art BERT models, including a comparative analysis of monolingual and multilingual variants.

Findings

01

MahaBERT outperforms other models on all datasets.

02

L3Cube-MahaNews is the largest Marathi news classification corpus.

03

Datasets are publicly available for further research.

Abstract

The availability of text or topic classification datasets in the low-resource Marathi language is limited, typically consisting of fewer than 4 target labels, with some achieving nearly perfect accuracy. In this work, we introduce L3Cube-MahaNews, a Marathi text classification corpus that focuses on News headlines and articles. This corpus stands out as the largest supervised Marathi Corpus, containing over 1.05L records classified into a diverse range of 12 categories. To accommodate different document lengths, MahaNews comprises three supervised datasets specifically designed for short text, long documents, and medium paragraphs. The consistent labeling across these datasets facilitates document length-based analysis. We provide detailed data statistics and baseline results on these datasets using state-of-the-art pre-trained BERT models. We conduct a comparative analysis between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

l3cube-pune/MarathiNLP
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Dense Connections · Linear Warmup With Linear Decay · Weight Decay · Adam · Layer Normalization · Attention Dropout · Multi-Head Attention