KaWAT: A Word Analogy Task Dataset for Indonesian

Kemal Kurniawan

arXiv:1906.09912·cs.CL·June 25, 2019·1 cites

KaWAT: A Word Analogy Task Dataset for Indonesian

Kemal Kurniawan

PDF

Open Access 2 Repos 1 Datasets

TL;DR

KaWAT is a new Indonesian word analogy dataset that helps evaluate and improve pretrained Indonesian word embeddings, showing their effectiveness in downstream tasks.

Contribution

The paper introduces KaWAT, the first comprehensive Indonesian word analogy dataset, and evaluates various embeddings, demonstrating their benefits in downstream applications.

Findings

01

Pretrained embeddings improve downstream task performance.

02

Embeddings trained on news corpus outperform others.

03

Using KaWAT aids in evaluating Indonesian word embeddings.

Abstract

We introduced KaWAT (Kata Word Analogy Task), a new word analogy task dataset for Indonesian. We evaluated on it several existing pretrained Indonesian word embeddings and embeddings trained on Indonesian online news corpus. We also tested them on two downstream tasks and found that pretrained word embeddings helped either by reducing the training epochs or yielding significant performance gains.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

SEACrowd/kawat
dataset· 6 dl
6 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining