AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair
Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song

TL;DR
AuditRepairBench provides a large, modular dataset and evaluation framework to analyze and mitigate ranking instability caused by evaluator reconfiguration in agent repair leaderboards.
Contribution
It introduces a comprehensive paired-execution trace corpus and a modular screening architecture to study and reduce ranking instability in agent repair evaluations.
Findings
Pooled AUROC of 0.83 on a source-level channel-surgery subset.
Screening-guided blinding patches reduce rank displacement by 55-74%.
AuditRepairBench-Lite preserves leaderboard Kendall tau at 0.88 with minimal computational resources.
Abstract
Agent-repair leaderboards reorder under evaluator reconfiguration, and a measurable share of the reordering is produced by methods that consult evaluator-derived signal during internal selection of candidate repairs. We document this failure mode on a public leaderboard and release AuditRepairBench, a paired-execution trace corpus of 576,000 registered cells (96,000 executed) that operationalizes evaluator-channel-blocking ranking instability within a declared observability boundary. A modular screening architecture decides pathway-blocking through four interchangeable implementations, a learned influence proxy, a rule-based channel-exposure ratio that uses no trained model, a counterfactual sensitivity proxy, and a sparse human-audit proxy, combined into a screening posterior that feeds a cell-level flip functional, a set-valued label, a stratified system score, and a set-valued…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
