GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

Tobias Lindenbauer; Egor Bogomolov; Yaroslav Zharov

arXiv:2505.22583·cs.SE·May 29, 2025

GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

Tobias Lindenbauer, Egor Bogomolov, Yaroslav Zharov

PDF

Open Access 1 Repo 1 Video

TL;DR

GitGoodBench introduces a new benchmark for evaluating AI agents on version control tasks, addressing a gap in existing software engineering benchmarks by focusing on VCS operations.

Contribution

It presents the first benchmark specifically designed for assessing AI agent performance on Git version control workflows in software engineering.

Findings

01

Baseline GPT-4o performance achieves 21.11% solve rate.

02

Benchmark includes 900 samples for comprehensive evaluation.

03

Provides datasets for training, rapid prototyping, and evaluation.

Abstract

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on VCS tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JetBrains-Research/git-good-bench
noneOfficial

Videos

GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git· underline

Taxonomy

TopicsSoftware Engineering Research · Scientific Computing and Data Management · Advanced Software Engineering Methodologies