KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension

Seungyoung Lim; Myungji Kim; Jooyoul Lee

arXiv:1909.07005·cs.CL·September 18, 2019·37 cites

KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension

Seungyoung Lim, Myungji Kim, Jooyoul Lee

PDF

Open Access 2 Models 2 Datasets

TL;DR

KorQuAD1.0 is a large-scale Korean dataset with over 70,000 question-answer pairs from Wikipedia, designed to advance machine reading comprehension and support multilingual NLP research.

Contribution

It introduces the first large-scale Korean MRC dataset, enabling research in Korean language understanding and multilingual NLP tasks.

Findings

01

Provides a comprehensive Korean QA dataset with 70,000+ pairs

02

Facilitates development of Korean MRC models and benchmarks

03

Encourages multilingual NLP research through a public challenge

Abstract

Machine Reading Comprehension (MRC) is a task that requires machine to understand natural language and answer questions by reading a document. It is the core of automatic response technology such as chatbots and automatized customer supporting systems. We present Korean Question Answering Dataset(KorQuAD), a large-scale Korean dataset for extractive machine reading comprehension task. It consists of 70,000+ human generated question-answer pairs on Korean Wikipedia articles. We release KorQuAD1.0 and launch a challenge at https://KorQuAD.github.io to encourage the development of multilingual natural language processing research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications