MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models
Jason Z Wang

TL;DR
MIRROR is a comprehensive benchmark assessing large language models' ability to use self-knowledge for decision-making, revealing universal failures in compositional self-prediction and the importance of external scaffolding for safety.
Contribution
Introduces MIRROR, a hierarchical benchmark with extensive experiments evaluating models' metacognitive calibration and highlights the necessity of external scaffolding for safer AI deployment.
Findings
Models cannot predict their performance on multi-domain tasks.
Models show above-chance but imperfect domain-specific self-knowledge.
External metacognitive control significantly reduces confident failure rate.
Abstract
We introduce MIRROR, a benchmark comprising eight experiments across four metacognitive levels that evaluates whether large language models can use self-knowledge to make better decisions. We evaluate 16 models from 8 labs across approximately 250,000 evaluation instances using five independent behavioral measurement channels. Core experiments are run across the full model roster; experiments with specialized infrastructure requirements report explicitly marked model subsets. We find two phenomena with direct implications for agentic deployment: (1) compositional self-prediction fails universally -- the Compositional Calibration Error ranges from 0.500 to 0.943 on the original 15-model Exp3-v1 set (and 0.434 to 0.758 on the balanced 16-model Exp3-v2 expansion), indicating that models cannot predict their own performance on multi-domain tasks, and (2) models exhibit above-chance but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
