L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages
Aishwarya Mirashi, Srushti Sonavane, Purva Lingayat, Tejas Padhiyar,, Raviraj Joshi

TL;DR
This paper introduces L3Cube-IndicNews, a comprehensive multilingual dataset for classifying news headlines and articles in 10 Indian languages, enabling improved NLP models and cross-lingual analysis.
Contribution
It provides a high-quality, multi-length news classification dataset for 10 Indic languages, facilitating research and development of language-specific and cross-lingual models.
Findings
Evaluated with 4 models including monolingual BERT and IndicSBERT
Achieved promising classification performance across languages
Shared datasets and models publicly for community use
Abstract
In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct datasets tailored to handle different document lengths that are classified as: Short Headlines Classification (SHC) dataset containing the news headline and news category, Long Document Classification (LDC) dataset containing the whole news article and the news category, and Long Paragraph Classification (LPC) containing sub-articles of the news and the news category. We maintain consistent labeling across all 3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗l3cube-pune/hindi-topic-all-docmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗l3cube-pune/bengali-topic-all-docmodel· 4 dl4 dl
- 🤗l3cube-pune/gujarati-topic-all-docmodel· 9 dl9 dl
- 🤗l3cube-pune/kannada-topic-all-docmodel· 6 dl6 dl
- 🤗l3cube-pune/malayalam-topic-all-docmodel· 4 dl4 dl
- 🤗l3cube-pune/odia-topic-all-docmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗l3cube-pune/punjabi-topic-all-docmodel· 3 dl3 dl
- 🤗l3cube-pune/tamil-topic-all-docmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗l3cube-pune/telugu-topic-all-docmodel· 4 dl4 dl
- 🤗l3cube-pune/marathi-topic-all-doc-v2model· 6 dl· ♡ 16 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Topic Modeling
MethodsAttention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Residual Connection · Attention Dropout · Dense Connections · Weight Decay · WordPiece · Dropout
