NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

Xiao Jia

arXiv:2605.01847·cs.AI·May 15, 2026

NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

Xiao Jia

PDF

TL;DR

NeuroState-Bench introduces a human-calibrated benchmark for assessing commitment integrity in LLM agent profiles, revealing divergence from task success and providing a more stable evaluation of commitment failures.

Contribution

It provides a novel, human-calibrated benchmark with side-query probes to operationalize and evaluate commitment integrity in LLM agents across diverse tasks.

Findings

01

Task success and commitment integrity often diverge.

02

Integrity rankings are more stable under distractor perturbation.

03

HCCIS-CORE effectively discriminates terminal task failure.

Abstract

Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench is a human-calibrated benchmark that operationalizes commitment integrity through benchmark-defined side-query probes rather than inferred hidden activations. The released inventory contains 144 deterministic tasks and 306 benchmark-defined side-query probes spanning eight cognitively motivated failure families, paired clean and distractor variants, and three difficulty bands. The main 32-profile evaluation contains a fixed 16-profile local subset and a matched 16-profile hosted large-model subset evaluated through the same benchmark pipeline. Human calibration uses the final merged reporting scope: 104 sampled task units, 216 raw annotations, and 108 adjudicated task rows, with weighted kappa = 0.977 and ICC(2,1) = 0.977.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.