TL;DR
This paper introduces L3Cube-MahaNews, the largest Marathi news classification dataset with 12 categories, and evaluates various BERT models, highlighting MahaBERT's superior performance for Marathi text classification.
Contribution
The work provides the first large-scale Marathi news classification datasets across different document lengths and offers baseline results with state-of-the-art BERT models, including a comparative analysis of monolingual and multilingual variants.
Findings
MahaBERT outperforms other models on all datasets.
L3Cube-MahaNews is the largest Marathi news classification corpus.
Datasets are publicly available for further research.
Abstract
The availability of text or topic classification datasets in the low-resource Marathi language is limited, typically consisting of fewer than 4 target labels, with some achieving nearly perfect accuracy. In this work, we introduce L3Cube-MahaNews, a Marathi text classification corpus that focuses on News headlines and articles. This corpus stands out as the largest supervised Marathi Corpus, containing over 1.05L records classified into a diverse range of 12 categories. To accommodate different document lengths, MahaNews comprises three supervised datasets specifically designed for short text, long documents, and medium paragraphs. The consistent labeling across these datasets facilitates document length-based analysis. We provide detailed data statistics and baseline results on these datasets using state-of-the-art pre-trained BERT models. We conduct a comparative analysis between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Dense Connections · Linear Warmup With Linear Decay · Weight Decay · Adam · Layer Normalization · Attention Dropout · Multi-Head Attention
