Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas
Jon-Paul Cacioli

TL;DR
This study evaluates domain-specific metacognitive monitoring in 33 advanced LLMs across the MMLU benchmark, revealing significant within-model variation and the importance of domain screening for deployment.
Contribution
It provides the first comprehensive analysis of domain-level metacognitive monitoring in frontier LLMs, highlighting variability and stability across models and domains.
Findings
Domain monitoring varies significantly within models.
Applied/Professional knowledge is easiest to monitor.
Formal Reasoning and Natural Science are hardest to monitor.
Abstract
Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six-domain grouping) to 33 frontier LLMs from eight model families and computed Type-2 AUROC per model-domain cell using verbalized confidence (0-100). Total observations: 47,151. Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall's W = .164). A subject-level coherence analysis (within-domain similarity ratio = 0.95) confirms the six-domain grouping is a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
