Benchmarks for Automated Commonsense Reasoning: A Survey

Ernest Davis

arXiv:2302.04752·cs.AI·February 24, 2023·6 cites

Benchmarks for Automated Commonsense Reasoning: A Survey

Ernest Davis

PDF

Open Access

TL;DR

This survey reviews over 139 AI commonsense benchmarks, analyzing their development, flaws, and gaps, and offers recommendations for creating more reliable and comprehensive benchmarks to better measure AI's commonsense reasoning abilities.

Contribution

The paper provides a comprehensive overview of existing commonsense benchmarks, identifies common flaws, and proposes guidelines for improving benchmark quality and coverage.

Findings

01

Many benchmarks are flawed and inconsistent in quality.

02

Existing benchmarks do not cover all aspects of commonsense reasoning.

03

Recommendations for future benchmark development are provided.

Abstract

More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of artificial intelligence (AI) systems. However, these benchmarks are often flawed and many aspects of common sense remain untested. Consequently, we do not currently have any reliable way of measuring to what extent existing AI systems have achieved these abilities. This paper surveys the development and uses of AI commonsense benchmarks. We discuss the nature of common sense; the role of common sense in AI; the goals served by constructing commonsense benchmarks; and desirable features of commonsense benchmarks. We analyze the common flaws in benchmarks, and we argue that it is worthwhile to invest the work needed ensure that benchmark examples are consistently high quality. We survey the various methods of constructing commonsense benchmarks. We enumerate 139…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI

MethodsTest