A Survey on Awesome Korean NLP Datasets
Byunghyun Ban

TL;DR
This survey reviews 15 key Korean NLP datasets, providing detailed summaries and practical guidance to support researchers developing Korean language processing technologies.
Contribution
It offers a comprehensive overview of Korean NLP datasets, including detailed descriptions, statistics, and practical instructions, facilitating research and development in Korean NLP.
Findings
15 Korean NLP datasets summarized with key details
High-resolution instructions and dataset statistics provided
A single table summarizes main characteristics of datasets
Abstract
English based datasets are commonly available from Kaggle, GitHub, or recently published papers. Although benchmark tests with English datasets are sufficient to show off the performances of new models and methods, still a researcher need to train and validate the models on Korean based datasets to produce a technology or product, suitable for Korean processing. This paper introduces 15 popular Korean based NLP datasets with summarized details such as volume, license, repositories, and other research results inspired by the datasets. Also, I provide high-resolution instructions with sample or statistics of datasets. The main characteristics of datasets are presented on a single table to provide a rapid summarization of datasets for researchers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
