Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol

Hongmin Li

arXiv:2605.11599·cs.LG·May 19, 2026

Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol

Hongmin Li

PDF

TL;DR

This paper introduces an audit-constrained protocol for evaluating reasoning in large language models, emphasizing semantic validity and reproducibility, and compares a novel prompt sampling method with uniform sampling.

Contribution

It presents a new methodological framework for targeted prompt variation evaluation, including the CAPS sampling method, under strict audit and budget constraints.

Findings

01

The protocol effectively identifies model-error prompt keys.

02

CAPS does not outperform uniform sampling in audited yield.

03

The approach emphasizes audited, reproducible evaluation over raw mismatch counts.

Abstract

Fixed reasoning benchmarks evaluate canonical prompts, but semantically valid changes in presentation can still change model behavior. Studies of prompt variation can reveal such failures, but without audit they can mix genuine model errors with invalid perturbations, extraction artifacts, and unmatched search procedures. We propose an audit-constrained protocol for targeted reasoning evaluation. Prompt variants are generated from a finite component grammar, rendered deterministically, evaluated under a fixed query budget, and counted as model errors only after semantic and extraction audit. Within this protocol we instantiate Component-Adaptive Prompt Sampling (CAPS), a score-based sampler over prompt components, and compare it with equal-budget uniform component sampling under the same task bank, renderer, model interface, decoding settings, and audit procedure. Across three audited…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.