Improving Indonesian Text Classification Using Multilingual Language   Model

Ilham Firdausi Putra (1); Ayu Purwarianti (1; 2) ((1) Institut; Teknologi Bandung; (2) U-CoE AI-VLB)

arXiv:2009.05713·cs.CL·September 15, 2020

Improving Indonesian Text Classification Using Multilingual Language Model

Ilham Firdausi Putra (1), Ayu Purwarianti (1, 2) ((1) Institut, Teknologi Bandung, (2) U-CoE AI-VLB)

PDF

1 Repo

TL;DR

This paper explores how multilingual language models can enhance Indonesian text classification by combining English and Indonesian data, showing improved performance especially with limited Indonesian data.

Contribution

It demonstrates the effectiveness of using multilingual models with English data to improve Indonesian text classification, both through feature-based and fine-tuning approaches.

Findings

01

Adding English data improves Indonesian classification accuracy

02

Multilingual models outperform monolingual models in low-resource settings

03

Fine-tuning with English data enhances model performance

Abstract

Compared to English, the amount of labeled data for Indonesian text classification tasks is very small. Recently developed multilingual language models have shown its ability to create multilingual representations effectively. This paper investigates the effect of combining English and Indonesian data on building Indonesian text classification (e.g., sentiment analysis and hate speech) using multilingual language models. Using the feature-based approach, we observe its performance on various data sizes and total added English data. The experiment showed that the addition of English data, especially if the amount of Indonesian data is small, improves performance. Using the fine-tuning approach, we further showed its effectiveness in utilizing the English language to build Indonesian text classification models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ilhamfp/indonesian-text-classification-multilingual
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.