MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models

Jason Z Wang

arXiv:2604.19809·cs.AI·April 23, 2026

MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models

Jason Z Wang

PDF

TL;DR

MIRROR is a comprehensive benchmark assessing large language models' ability to use self-knowledge for decision-making, revealing universal failures in compositional self-prediction and the importance of external scaffolding for safety.

Contribution

Introduces MIRROR, a hierarchical benchmark with extensive experiments evaluating models' metacognitive calibration and highlights the necessity of external scaffolding for safer AI deployment.

Findings

01

Models cannot predict their performance on multi-domain tasks.

02

Models show above-chance but imperfect domain-specific self-knowledge.

03

External metacognitive control significantly reduces confident failure rate.

Abstract

We introduce MIRROR, a benchmark comprising eight experiments across four metacognitive levels that evaluates whether large language models can use self-knowledge to make better decisions. We evaluate 16 models from 8 labs across approximately 250,000 evaluation instances using five independent behavioral measurement channels. Core experiments are run across the full model roster; experiments with specialized infrastructure requirements report explicitly marked model subsets. We find two phenomena with direct implications for agentic deployment: (1) compositional self-prediction fails universally -- the Compositional Calibration Error ranges from 0.500 to 0.943 on the original 15-model Exp3-v1 set (and 0.434 to 0.758 on the balanced 16-model Exp3-v2 expansion), indicating that models cannot predict their own performance on multi-domain tasks, and (2) models exhibit above-chance but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.