A Survey on Awesome Korean NLP Datasets

Byunghyun Ban

arXiv:2112.01624·cs.CL·November 29, 2022

A Survey on Awesome Korean NLP Datasets

Byunghyun Ban

PDF

Open Access

TL;DR

This survey reviews 15 key Korean NLP datasets, providing detailed summaries and practical guidance to support researchers developing Korean language processing technologies.

Contribution

It offers a comprehensive overview of Korean NLP datasets, including detailed descriptions, statistics, and practical instructions, facilitating research and development in Korean NLP.

Findings

01

15 Korean NLP datasets summarized with key details

02

High-resolution instructions and dataset statistics provided

03

A single table summarizes main characteristics of datasets

Abstract

English based datasets are commonly available from Kaggle, GitHub, or recently published papers. Although benchmark tests with English datasets are sufficient to show off the performances of new models and methods, still a researcher need to train and validate the models on Korean based datasets to produce a technology or product, suitable for Korean processing. This paper introduces 15 popular Korean based NLP datasets with summarized details such as volume, license, repositories, and other research results inspired by the datasets. Also, I provide high-resolution instructions with sample or statistics of datasets. The main characteristics of datasets are presented on a single table to provide a rapid summarization of datasets for researchers.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling