MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

Farhad Abtahi; Abdolamir Karbalaie; Eduardo Illueca-Fernandez; Fernando Seoane

arXiv:2604.16009·cs.AI·April 20, 2026

MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

Farhad Abtahi, Abdolamir Karbalaie, Eduardo Illueca-Fernandez, Fernando Seoane

PDF

TL;DR

MEDLEY-BENCH is a new benchmark for evaluating AI metacognition, measuring models' ability to monitor and regulate their reasoning across multiple domains and social influences.

Contribution

It introduces a comprehensive benchmark with scores for metacognitive abilities, revealing scale-independent control and highlighting the importance of calibrated belief revision.

Findings

01

Model size correlates with evaluation ability but not control.

02

Smaller models often outperform larger ones in metacognitive tasks.

03

Evaluation ability is systematically weaker than other metacognitive sub-abilities.

Abstract

Metacognition, the ability to monitor and regulate one's own reasoning, remains under-evaluated in AI benchmarking. We introduce MEDLEY-BENCH, a benchmark of behavioural metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. The benchmark evaluates 35 models from 12 families on 130 ambiguous instances across five domains and reports two complementary scores: the Medley Metacognition Score (MMS), a tier-based aggregate of reflective updating, social robustness, and epistemic articulation, and the Medley Ability Score (MAS), derived from four metacognitive sub-abilities. Results show a robust evaluation/control dissociation: evaluation ability increases with model size within families, whereas control does not. In a follow-up progressive adversarial analysis of 11 models, we observed two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.