From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
Melissa Roemmele, Andrew S. Gordon

TL;DR
This paper explores how large language models can be used to generate and evaluate commonsense reasoning questions, revealing that models proficient in answering such questions are also better at creating them.
Contribution
It demonstrates that LLMs can be effectively used as authors of commonsense assessment items, especially when they perform well on existing benchmarks like COPA.
Findings
LLMs successful on COPA are better at generating similar questions.
Generated items show comparable quality to human-created questions.
Analysis includes both LLM and human evaluations.
Abstract
LLMs can now perform a variety of complex writing tasks. They also excel in answering questions pertaining to natural language inference and commonsense reasoning. Composing these questions is itself a skilled writing task, so in this paper we consider LLMs as authors of commonsense assessment items. We prompt LLMs to generate items in the style of a prominent benchmark for commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine the outcome according to analyses facilitated by the LLMs and human annotation. We find that LLMs that succeed in answering the original COPA benchmark are also more successful in authoring their own items.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsArtificial Intelligence in Law
