NSINA: A News Corpus for Sinhala

Hansi Hettiarachchi; Damith Premasiri; Lasitha Uyangodage; Tharindu; Ranasinghe

arXiv:2403.16571·cs.CL·March 26, 2024·1 cites

NSINA: A News Corpus for Sinhala

Hansi Hettiarachchi, Damith Premasiri, Lasitha Uyangodage, Tharindu, Ranasinghe

PDF

Open Access 4 Repos

TL;DR

This paper introduces NSINA, the largest Sinhala news corpus with over 500,000 articles, designed to support NLP tasks and improve language models for Sinhala, a low-resource language.

Contribution

It provides the first large-scale Sinhala news corpus and benchmarks for three NLP tasks, addressing data scarcity and benchmarking challenges.

Findings

01

NSINA is the largest Sinhala news corpus to date.

02

Benchmark results for three NLP tasks on NSINA.

03

Facilitates development of Sinhala language models.

Abstract

The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSINA, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSINA aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSINA is the largest news corpus for Sinhala, available up to date.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques