CI-Repair-Bench: A Repository-Aware Benchmark for Automated Patch Validation via CI Workflows
Rabeya Khatun Muna, Md Nakhla Rafi, and Tse-Hsun (Peter) Chen

TL;DR
CI-Repair-Bench is a realistic benchmark derived from real GitHub Actions workflows, designed to evaluate automated program repair methods at the repository level, considering diverse CI failure types.
Contribution
It introduces a new benchmark with 567 real CI failures, categorized into 12 error types, and evaluates repair correctness through full CI re-execution, reflecting real-world scenarios.
Findings
Automated repair is most effective for formatting and linting failures.
Environment and dependency failures remain challenging for repair methods.
The best-performing LLM achieved an 18.9% success rate.
Abstract
Continuous Integration (CI) enforces repository-level correctness through multi-stage workflows and is central to modern software development, yet diagnosing and repairing CI failures remains challenging. Unlike traditional program repair, CI failures frequently involve non-code artifacts, environment and dependency issues, noisy execution logs, and workflow-level constraints. Existing program repair benchmarks fall short in this setting: they are largely test-centric, restrict repairs to source code, assume fixed execution environments, and evaluate under simplified CI workflows that do not reflect real repository-level validation. We introduce CI-Repair-Bench, a benchmark for CI-verified, repository-level program repair constructed from real GitHub Actions executions. It contains 567 CI failure instances from 103 repositories and evaluates repair correctness exclusively through full…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
