Toward a Standardized and More Accurate Indonesian Part-of-Speech   Tagging

Kemal Kurniawan; Alham Fikri Aji

arXiv:1809.03391·cs.CL·February 27, 2019

Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging

Kemal Kurniawan, Alham Fikri Aji

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper evaluates various POS tagging techniques for Indonesian, achieving a new state-of-the-art with neural networks and providing a standardized dataset split for future research.

Contribution

It introduces a standardized dataset split for Indonesian POS tagging and demonstrates the effectiveness of neural network models, setting a new performance benchmark.

Findings

01

Recurrent neural network achieved 97.47 F1 score.

02

Neural models outperform rule-based and CRF approaches.

03

Standardized dataset split released for future research.

Abstract

Previous work in Indonesian part-of-speech (POS) tagging are hard to compare as they are not evaluated on a common dataset. Furthermore, in spite of the success of neural network models for English POS tagging, they are rarely explored for Indonesian. In this paper, we explored various techniques for Indonesian POS tagging, including rule-based, CRF, and neural network-based models. We evaluated our models on the IDN Tagged Corpus. A new state-of-the-art of 97.47 F1 score is achieved with a recurrent neural network. To provide a standard for future work, we release the dataset split that we used publicly.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kmkurn/id-pos-tagging
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsConditional Random Field