IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding
Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel, Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra,, Pascale Fung, Syafri Bahar, Ayu Purwarianti

TL;DR
This paper introduces IndoNLU, a comprehensive benchmark and resource suite for Indonesian natural language understanding, including diverse datasets, pre-trained models, and evaluation frameworks to accelerate NLP research for Indonesian.
Contribution
It provides the first extensive set of datasets, pre-trained models, and benchmarking tools specifically for Indonesian NLP tasks, filling a critical resource gap.
Findings
Baseline models established for all twelve tasks.
Diverse datasets from multiple domains and styles.
Pre-trained IndoBERT models trained on large Indonesian corpus.
Abstract
Although Indonesian is known to be the fourth most frequently used language over the internet, the research progress on this language in the natural language processing (NLP) is slow-moving due to a lack of available resources. In response, we introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks. IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity. The datasets for the tasks lie in different domains and styles to ensure task diversity. We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset Indo4B collected from publicly available sources such as social media texts, blogs, news, and websites. We release baseline models for all twelve tasks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗indobenchmark/indobert-base-p1model· 208k dl· ♡ 45208k dl♡ 45
- 🤗indobenchmark/indobert-base-p2model· 17k dl· ♡ 717k dl♡ 7
- 🤗indobenchmark/indobert-large-p1model· 967 dl· ♡ 4967 dl♡ 4
- 🤗indobenchmark/indobert-large-p2model· 1.2k dl· ♡ 91.2k dl♡ 9
- 🤗indobenchmark/indobert-lite-base-p1model· 708 dl708 dl
- 🤗indobenchmark/indobert-lite-base-p2model· 655 dl655 dl
- 🤗indobenchmark/indobert-lite-large-p1model· 38 dl38 dl
- 🤗indobenchmark/indobert-lite-large-p2model· 56 dl· ♡ 156 dl♡ 1
- 🤗tyqiangz/indobert-lite-large-p2-smsamodel· 223 dl· ♡ 1223 dl♡ 1
- 🤗derhan/indobert-sbu-paketmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Edcuational Technology Systems
