R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason
Naoya Inoue, Pontus Stenetorp, Kentaro Inui

TL;DR
R4C introduces a new benchmark dataset for reading comprehension that emphasizes evaluating systems' reasoning abilities through derivations, addressing biases in existing datasets and enabling more reliable progress measurement.
Contribution
The paper presents R4C, a novel dataset with annotated derivations for RC, and a scalable crowdsourcing framework to evaluate reasoning skills in RC systems.
Findings
Automatic metrics using multiple derivations are reliable.
R4C assesses reasoning skills different from existing benchmarks.
The dataset contains 4.6k questions with 13.8k derivations.
Abstract
Recent studies have revealed that reading comprehension (RC) systems learn to exploit annotation artifacts and other biases in current datasets. This prevents the community from reliably measuring the progress of RC systems. To address this issue, we introduce R4C, a new task for evaluating RC systems' internal reasoning. R4C requires giving not only answers but also derivations: explanations that justify predicted answers. We present a reliable, crowdsourced framework for scalably annotating RC datasets with derivations. We create and publicly release the R4C dataset, the first, quality-assured dataset consisting of 4.6k questions, each of which is annotated with 3 reference derivations (i.e. 13.8k derivations). Experiments show that our automatic evaluation metrics using multiple reference derivations are reliable, and that R4C assesses different skills from an existing benchmark.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
