CebuaNER: A New Baseline Cebuano Named Entity Recognition Model

Ma. Beatrice Emanuela Pilar; Ellyza Mari Papas; Mary Loise; Buenaventura; Dane Dedoroy; Myron Darrel Montefalcon; Jay Rhald Padilla; Lany; Maceda; Mideth Abisado; Joseph Marvin Imperial

arXiv:2310.00679·cs.CL·October 3, 2023

CebuaNER: A New Baseline Cebuano Named Entity Recognition Model

Ma. Beatrice Emanuela Pilar, Ellyza Mari Papas, Mary Loise, Buenaventura, Dane Dedoroy, Myron Darrel Montefalcon, Jay Rhald Padilla, Lany, Maceda, Mideth Abisado, Joseph Marvin Imperial

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper introduces CebuaNER, a baseline Cebuano NER model trained on a large annotated news corpus, achieving over 70% in key metrics and showing promise for crosslingual applications.

Contribution

It presents the first large-scale annotated Cebuano news corpus and a baseline NER model, advancing NLP resources for underrepresented Southeast Asian languages.

Findings

01

Achieved over 70% precision, recall, and F1 scores across entity tags.

02

Collected and annotated over 4,000 Cebuano news articles.

03

Demonstrated potential for crosslingual NER with Tagalog.

Abstract

Despite being one of the most linguistically diverse groups of countries, computational linguistics and language processing research in Southeast Asia has struggled to match the level of countries from the Global North. Thus, initiatives such as open-sourcing corpora and the development of baseline models for basic language processing tasks are important stepping stones to encourage the growth of research efforts in the field. To answer this call, we introduce CebuaNER, a new baseline model for named entity recognition (NER) in the Cebuano language. Cebuano is the second most-used native language in the Philippines, with over 20 million speakers. To build the model, we collected and annotated over 4,000 news articles, the largest of any work in the language, retrieved from online local Cebuano platforms to train algorithms such as Conditional Random Field and Bidirectional LSTM. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mebzmoren/cebuaner
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory