Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

Jon-Paul Cacioli

arXiv:2605.06673·cs.CL·May 11, 2026

Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

Jon-Paul Cacioli

PDF

TL;DR

This study evaluates domain-specific metacognitive monitoring in 33 advanced LLMs across the MMLU benchmark, revealing significant within-model variation and the importance of domain screening for deployment.

Contribution

It provides the first comprehensive analysis of domain-level metacognitive monitoring in frontier LLMs, highlighting variability and stability across models and domains.

Findings

01

Domain monitoring varies significantly within models.

02

Applied/Professional knowledge is easiest to monitor.

03

Formal Reasoning and Natural Science are hardest to monitor.

Abstract

Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six-domain grouping) to 33 frontier LLMs from eight model families and computed Type-2 AUROC per model-domain cell using verbalized confidence (0-100). Total observations: 47,151. Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall's W = .164). A subject-level coherence analysis (within-domain similarity ratio = 0.95) confirms the six-domain grouping is a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.