Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema
Yanai Elazar, Hongming Zhang, Yoav Goldberg, Dan Roth

TL;DR
This paper critically examines the Winograd Schema benchmarks, revealing that recent performance gains are largely due to artifacts and supervision rather than true commonsense reasoning, and proposes improved evaluation methods.
Contribution
It introduces a more robust evaluation framework for WS, identifies artifacts in existing benchmarks, and demonstrates that current models lack genuine commonsense reasoning in zero-shot settings.
Findings
Current WS evaluation is sub-optimal
Models perform randomly in strict zero-shot settings
Progress is mainly due to supervised training artifacts
Abstract
The Winograd Schema (WS) has been proposed as a test for measuring commonsense capabilities of models. Recently, pre-trained language model-based approaches have boosted performance on some WS benchmarks but the source of improvement is still not clear. This paper suggests that the apparent progress on WS may not necessarily reflect progress in commonsense reasoning. To support this claim, we first show that the current evaluation method of WS is sub-optimal and propose a modification that uses twin sentences for evaluation. We also propose two new baselines that indicate the existence of artifacts in WS benchmarks. We then develop a method for evaluating WS-like sentences in a zero-shot setting to account for the commonsense reasoning abilities acquired during the pretraining and observe that popular language models perform randomly in this setting when using our more strict…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
