From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense   Assessment Items

Melissa Roemmele; Andrew S. Gordon

arXiv:2410.14897·cs.CL·October 22, 2024

From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items

Melissa Roemmele, Andrew S. Gordon

PDF

Open Access 1 Video

TL;DR

This paper explores how large language models can be used to generate and evaluate commonsense reasoning questions, revealing that models proficient in answering such questions are also better at creating them.

Contribution

It demonstrates that LLMs can be effectively used as authors of commonsense assessment items, especially when they perform well on existing benchmarks like COPA.

Findings

01

LLMs successful on COPA are better at generating similar questions.

02

Generated items show comparable quality to human-created questions.

03

Analysis includes both LLM and human evaluations.

Abstract

LLMs can now perform a variety of complex writing tasks. They also excel in answering questions pertaining to natural language inference and commonsense reasoning. Composing these questions is itself a skilled writing task, so in this paper we consider LLMs as authors of commonsense assessment items. We prompt LLMs to generate items in the style of a prominent benchmark for commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine the outcome according to analyses facilitated by the LLMs and human annotation. We find that LLMs that succeed in answering the original COPA benchmark are also more successful in authoring their own items.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items· underline

Taxonomy

TopicsArtificial Intelligence in Law