MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition
Farhad Abtahi, Abdolamir Karbalaie, Eduardo Illueca-Fernandez, Fernando Seoane

TL;DR
MEDLEY-BENCH is a new benchmark for evaluating AI metacognition, measuring models' ability to monitor and regulate their reasoning across multiple domains and social influences.
Contribution
It introduces a comprehensive benchmark with scores for metacognitive abilities, revealing scale-independent control and highlighting the importance of calibrated belief revision.
Findings
Model size correlates with evaluation ability but not control.
Smaller models often outperform larger ones in metacognitive tasks.
Evaluation ability is systematically weaker than other metacognitive sub-abilities.
Abstract
Metacognition, the ability to monitor and regulate one's own reasoning, remains under-evaluated in AI benchmarking. We introduce MEDLEY-BENCH, a benchmark of behavioural metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. The benchmark evaluates 35 models from 12 families on 130 ambiguous instances across five domains and reports two complementary scores: the Medley Metacognition Score (MMS), a tier-based aggregate of reflective updating, social robustness, and epistemic articulation, and the Medley Ability Score (MAS), derived from four metacognitive sub-abilities. Results show a robust evaluation/control dissociation: evaluation ability increases with model size within families, whereas control does not. In a follow-up progressive adversarial analysis of 11 models, we observed two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
