GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

Ziyi Ni; Huacan Wang; Shuo Zhang; Shuo Lu; Ziyang He; Wang You; Zhenheng Tang; Yuntao Du; Bill Sun; Hongzhang Liu; Sen Hu; Ronghao Chen; Bo Li; Xin Li; Chen Hu; Binxing Jiao; Daxin Jiang; Pin Lyu

arXiv:2508.18993·cs.SE·September 16, 2025

GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, Ziyang He, Wang You, Zhenheng Tang, Yuntao Du, Bill Sun, Hongzhang Liu, Sen Hu, Ronghao Chen, Bo Li, Xin Li, Chen Hu, Binxing Jiao, Daxin Jiang, Pin Lyu

PDF

1 Datasets 1 Video

TL;DR

GitTaskBench is a new benchmark that evaluates code agents on real-world tasks using large code repositories, highlighting current challenges and guiding future improvements in practical code reasoning and execution.

Contribution

Introduces GitTaskBench, a comprehensive benchmark with 54 realistic tasks across multiple domains, and proposes the alpha-value metric to assess economic benefits of code agents.

Findings

01

State-of-the-art agents solve less than 50% of tasks

02

Environment setup and dependency resolution are major failure points

03

Progress has been made, but significant challenges remain

Abstract

Beyond scratch coding, exploiting large-scale code repositories (e.g., GitHub) for practical tasks is vital in real-world software development, yet current benchmarks rarely evaluate code agents in such authentic, workflow-driven scenarios. To bridge this gap, we introduce GitTaskBench, a benchmark designed to systematically assess this capability via 54 realistic tasks across 7 modalities and 7 domains. Each task pairs a relevant repository with an automated, human-curated evaluation harness specifying practical success criteria. Beyond measuring execution and task success, we also propose the alpha-value metric to quantify the economic benefit of agent performance, which integrates task success rates, token cost, and average developer salaries. Experiments across three state-of-the-art agent frameworks with multiple advanced LLMs show that leveraging code repositories for complex task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Nicole-Yi/GitTaskBench
dataset· 442 dl
442 dl

Videos

GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging· underline