The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring
Jon-Paul Cacioli

TL;DR
This paper presents a comprehensive benchmark for evaluating large language models' self-monitoring abilities across multiple cognitive domains, using a psychometric framework and behavioral assays.
Contribution
It introduces a novel cross-domain metacognitive assessment battery grounded in established psychological paradigms, applied to 20 LLMs with publicly available data and code.
Findings
Discriminates three LLM profiles: confidence, withdrawal, sensitivity.
Reveals inverse relationship between accuracy and metacognitive sensitivity.
Shows architecture-dependent differences in metacognitive calibration scaling.
Abstract
We introduce a cross-domain behavioural assay of monitoring-control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework and applying human psychometric methodology to LLM evaluation. The battery comprises 524 items across six cognitive domains (learning, metacognitive calibration, social cognition, attention, executive function, prospective regulation), each grounded in an established experimental paradigm. Tasks T1-T5 were pre-registered on OSF prior to data collection; T6 was added as an exploratory extension. After every forced-choice response, dual probes adapted from Koriat and Goldsmith (1996) ask the model to KEEP or WITHDRAW its answer and to BET or decline. The critical metric is the withdraw delta: the difference in withdrawal rate between incorrect and correct items. Applied to 20 frontier LLMs (10,480 evaluations), the battery discriminates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
