AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair

Yuelin Hu; Zhenbo Yu; Zhengxue Cheng; Wei Liu; Li Song

arXiv:2605.04624·cs.AI·May 7, 2026

AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair

Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song

PDF

TL;DR

AuditRepairBench provides a large, modular dataset and evaluation framework to analyze and mitigate ranking instability caused by evaluator reconfiguration in agent repair leaderboards.

Contribution

It introduces a comprehensive paired-execution trace corpus and a modular screening architecture to study and reduce ranking instability in agent repair evaluations.

Findings

01

Pooled AUROC of 0.83 on a source-level channel-surgery subset.

02

Screening-guided blinding patches reduce rank displacement by 55-74%.

03

AuditRepairBench-Lite preserves leaderboard Kendall tau at 0.88 with minimal computational resources.

Abstract

Agent-repair leaderboards reorder under evaluator reconfiguration, and a measurable share of the reordering is produced by methods that consult evaluator-derived signal during internal selection of candidate repairs. We document this failure mode on a public leaderboard and release AuditRepairBench, a paired-execution trace corpus of 576,000 registered cells (96,000 executed) that operationalizes evaluator-channel-blocking ranking instability within a declared observability boundary. A modular screening architecture decides pathway-blocking through four interchangeable implementations, a learned influence proxy, a rule-based channel-exposure ratio that uses no trained model, a counterfactual sensitivity proxy, and a sparse human-audit proxy, combined into a screening posterior that feeds a cell-level flip functional, a set-valued label, a stratified system score, and a set-valued…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.