TL;DR
This paper introduces a new dataset and reasoning tasks based on wikiHow to evaluate models' understanding of goal-step and temporal relations in procedural events, highlighting a significant gap between AI and human performance.
Contribution
It presents a novel dataset and benchmark for reasoning about procedural goal and step relations, enabling better evaluation of commonsense inference in AI models.
Findings
Transformer models lag behind humans by 10-20% on the benchmark.
Models trained on the dataset transfer effectively to out-of-domain tasks.
Significant improvements in zero- and few-shot learning on related benchmarks.
Abstract
We propose a suite of reasoning tasks on two types of relations between procedural events: goal-step relations ("learn poses" is a step in the larger goal of "doing yoga") and step-step temporal relations ("buy a yoga mat" typically precedes "learn poses"). We introduce a dataset targeting these two relations based on wikiHow, a website of instructional how-to articles. Our human-validated test set serves as a reliable benchmark for commonsense inference, with a gap of about 10% to 20% between the performance of state-of-the-art transformer models and human performance. Our automatically-generated training set allows models to effectively transfer to out-of-domain tasks requiring knowledge of procedural events, with greatly improved performances on SWAG, Snips, and the Story Cloze Test in zero- and few-shot settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
