Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol
Hongmin Li

TL;DR
This paper introduces an audit-constrained protocol for evaluating reasoning in large language models, emphasizing semantic validity and reproducibility, and compares a novel prompt sampling method with uniform sampling.
Contribution
It presents a new methodological framework for targeted prompt variation evaluation, including the CAPS sampling method, under strict audit and budget constraints.
Findings
The protocol effectively identifies model-error prompt keys.
CAPS does not outperform uniform sampling in audited yield.
The approach emphasizes audited, reproducible evaluation over raw mismatch counts.
Abstract
Fixed reasoning benchmarks evaluate canonical prompts, but semantically valid changes in presentation can still change model behavior. Studies of prompt variation can reveal such failures, but without audit they can mix genuine model errors with invalid perturbations, extraction artifacts, and unmatched search procedures. We propose an audit-constrained protocol for targeted reasoning evaluation. Prompt variants are generated from a finite component grammar, rendered deterministically, evaluated under a fixed query budget, and counted as model errors only after semantic and extraction audit. Within this protocol we instantiate Component-Adaptive Prompt Sampling (CAPS), a score-based sampler over prompt components, and compare it with equal-budget uniform component sampling under the same task bank, renderer, model interface, decoding settings, and audit procedure. Across three audited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
