NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
Xiao Jia

TL;DR
NeuroState-Bench introduces a human-calibrated benchmark for assessing commitment integrity in LLM agent profiles, revealing divergence from task success and providing a more stable evaluation of commitment failures.
Contribution
It provides a novel, human-calibrated benchmark with side-query probes to operationalize and evaluate commitment integrity in LLM agents across diverse tasks.
Findings
Task success and commitment integrity often diverge.
Integrity rankings are more stable under distractor perturbation.
HCCIS-CORE effectively discriminates terminal task failure.
Abstract
Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench is a human-calibrated benchmark that operationalizes commitment integrity through benchmark-defined side-query probes rather than inferred hidden activations. The released inventory contains 144 deterministic tasks and 306 benchmark-defined side-query probes spanning eight cognitively motivated failure families, paired clean and distractor variants, and three difficulty bands. The main 32-profile evaluation contains a fixed 16-profile local subset and a matched 16-profile hosted large-model subset evaluated through the same benchmark pipeline. Human calibration uses the final merged reporting scope: 104 sampled task units, 216 raw annotations, and 108 adjudicated task rows, with weighted kappa = 0.977 and ICC(2,1) = 0.977.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
