IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model   for Indonesian NLP

Fajri Koto; Afshin Rahimi; Jey Han Lau; Timothy Baldwin

arXiv:2011.00677·cs.CL·November 3, 2020·32 cites

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

Fajri Koto, Afshin Rahimi, Jey Han Lau, Timothy Baldwin

PDF

Open Access 3 Models 5 Datasets

TL;DR

This paper introduces IndoLEM, a comprehensive Indonesian NLP benchmark dataset, and IndoBERT, a new pre-trained language model that achieves state-of-the-art results across multiple tasks.

Contribution

It provides the first large-scale Indonesian NLP benchmark dataset and a specialized pre-trained model, addressing resource scarcity and standardization issues.

Findings

01

IndoBERT outperforms existing models on most tasks.

02

IndoLEM covers seven diverse NLP tasks for Indonesian.

03

IndoBERT achieves state-of-the-art performance.

Abstract

Although the Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world, it is under-represented in NLP research. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. In this work, we release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse. We additionally release IndoBERT, a new pre-trained language model for Indonesian, and evaluate it over IndoLEM, in addition to benchmarking it against existing resources. Our experiments show that IndoBERT achieves state-of-the-art performance over most of the tasks in IndoLEM.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining