# Investigating Prior Knowledge for Challenging Chinese Machine Reading   Comprehension

**Authors:** Kai Sun, Dian Yu, Dong Yu, Claire Cardie

arXiv: 1904.09679 · 2019-12-18

## TL;DR

This paper introduces the C^3 dataset for Chinese machine reading comprehension, highlighting the significant challenge of questions requiring diverse prior knowledge and analyzing the performance gap between models and humans.

## Contribution

It presents the first large-scale Chinese MRC dataset with real-world questions and provides a comprehensive analysis of the role of prior knowledge in comprehension tasks.

## Key findings

- Models achieve 68.5% accuracy, below human 96.0%.
- Prior knowledge significantly impacts model performance.
- Data augmentation improves accuracy on knowledge-intensive questions.

## Abstract

Machine reading comprehension tasks require a machine reader to answer questions relevant to the given document. In this paper, we present the first free-form multiple-Choice Chinese machine reading Comprehension dataset (C^3), containing 13,369 documents (dialogues or more formally written mixed-genre texts) and their associated 19,577 multiple-choice free-form questions collected from Chinese-as-a-second-language examinations.   We present a comprehensive analysis of the prior knowledge (i.e., linguistic, domain-specific, and general world knowledge) needed for these real-world problems. We implement rule-based and popular neural methods and find that there is still a significant performance gap between the best performing model (68.5%) and human readers (96.0%), especially on problems that require prior knowledge. We further study the effects of distractor plausibility and data augmentation based on translated relevant datasets for English on model performance. We expect C^3 to present great challenges to existing systems as answering 86.8% of questions requires both knowledge within and beyond the accompanying document, and we hope that C^3 can serve as a platform to study how to leverage various kinds of prior knowledge to better understand a given written or orally oriented text. C^3 is available at https://dataset.org/c3/.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.09679/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/1904.09679/full.md

## References

62 references — full list in the complete paper: https://tomesphere.com/paper/1904.09679/full.md

---
Source: https://tomesphere.com/paper/1904.09679