Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan   Facebook

Yudhanjaya Wijeratne; Nisansa de Silva

arXiv:2007.07884·cs.CL·July 16, 2020

Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

Yudhanjaya Wijeratne, Nisansa de Silva

PDF

1 Repo

TL;DR

This paper introduces two extensive Sinhala language corpora from Sri Lankan Facebook pages spanning a decade, along with a list of algorithmically derived stopwords, facilitating linguistic and computational research.

Contribution

It provides the first large-scale, annotated Sinhala corpora from social media and a systematic list of stopwords, supporting NLP applications in Sinhala language processing.

Findings

01

Large-scale Sinhala Facebook corpora covering 2010-2020

02

Algorithmically derived Sinhala stopwords list

03

Annotated metadata for each corpus entry

Abstract

This paper presents two colloquial Sinhala language corpora from the language efforts of the Data, Analysis and Policy team of LIRNEasia, as well as a list of algorithmically derived stopwords. The larger of the two corpora spans 2010 to 2020 and contains 28,825,820 to 29,549,672 words of multilingual text posted by 533 Sri Lankan Facebook pages, including politics, media, celebrities, and other categories; the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from the larger. Both corpora have markers for their date of creation, page of origin, and content type.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LIRNEasia/FacebookDecadeCorpora
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.