An Open-Source Dataset and A Multi-Task Model for Malay Named Entity Recognition
Yingwen Fu, Nankai Lin, Zhihe Yang, Shengyi Jiang

TL;DR
This paper introduces a new Malay NER dataset and a multi-task model with boundary detection and a gated ignoring mechanism, achieving competitive results and providing valuable resources for low-resource language NLP.
Contribution
The work presents a novel dataset construction framework for Malay NER and a multi-task model with boundary detection and error mitigation mechanisms.
Findings
The dataset MYNER contains 28,991 sentences and over 384,000 tokens.
The proposed model achieves results comparable to baseline methods on MYNER.
The dataset and model are publicly released as benchmarks.
Abstract
Named entity recognition (NER) is a fundamental task of natural language processing (NLP). However, most state-of-the-art research is mainly oriented to high-resource languages such as English and has not been widely applied to low-resource languages. In Malay language, relevant NER resources are limited. In this work, we propose a dataset construction framework, which is based on labeled datasets of homologous languages and iterative optimization, to build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens). Additionally, to better integrate boundary information for NER, we propose a multi-task (MT) model with a bidirectional revision (Bi-revision) mechanism for Malay NER task. Specifically, an auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways. Furthermore, a gated ignoring mechanism is proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
