IndoNLI: A Natural Language Inference Dataset for Indonesian

Rahmad Mahendra; Alham Fikri Aji; Samuel Louvan; Fahrurrozi Rahman,; and Clara Vania

arXiv:2110.14566·cs.CL·March 30, 2022

IndoNLI: A Natural Language Inference Dataset for Indonesian

Rahmad Mahendra, Alham Fikri Aji, Samuel Louvan, Fahrurrozi Rahman,, and Clara Vania

PDF

Open Access 1 Repo 3 Models

TL;DR

IndoNLI is the first challenging, diverse, and expert-annotated natural language inference dataset for Indonesian, designed to advance NLP research by providing a rigorous test-bed with various linguistic phenomena.

Contribution

This paper introduces IndoNLI, the first human-elicited NLI dataset for Indonesian, with expert annotations and diverse linguistic phenomena, filling a critical gap in Indonesian NLP resources.

Findings

01

XLM-R outperforms other models on IndoNLI

02

Human performance is significantly higher than model accuracy

03

Expert-annotated data is more diverse and less artifact-prone

Abstract

We present IndoNLI, the first human-elicited NLI dataset for Indonesian. We adapt the data collection protocol for MNLI and collect nearly 18K sentence pairs annotated by crowd workers and experts. The expert-annotated data is used exclusively as a test set. It is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning. Experiment results show that XLM-R outperforms other pre-trained models in our data. The best performance on the expert-annotated data is still far below human performance (13.4% accuracy gap), suggesting that this test set is especially challenging. Furthermore, our analysis shows that our expert-annotated data is more diverse and contains fewer annotation artifacts than the crowd-annotated data. We hope this dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ir-nlp-csui/indonli
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies

MethodsTest · XLM-R