GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git
Tobias Lindenbauer, Egor Bogomolov, Yaroslav Zharov

TL;DR
GitGoodBench introduces a new benchmark for evaluating AI agents on version control tasks, addressing a gap in existing software engineering benchmarks by focusing on VCS operations.
Contribution
It presents the first benchmark specifically designed for assessing AI agent performance on Git version control workflows in software engineering.
Findings
Baseline GPT-4o performance achieves 21.11% solve rate.
Benchmark includes 900 samples for comprehensive evaluation.
Provides datasets for training, rapid prototyping, and evaluation.
Abstract
Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on VCS tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Advanced Software Engineering Methodologies
