TL;DR
SWE-CI introduces a new benchmark based on continuous integration to evaluate how well language model agents maintain code quality over long-term software development cycles.
Contribution
It presents the first repository-level benchmark that assesses agent capabilities in sustaining code maintainability through long-term evolution.
Findings
Benchmark includes 100 real-world tasks with extensive development histories.
Agents are evaluated on their ability to perform multiple analysis and coding iterations.
Provides insights into long-term code quality maintenance by LLM-powered agents.
Abstract
Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose SWE-CI, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term functional correctness toward dynamic, long-term maintainability. The key insight is simple: Maintainability can be revealed by tracking how functional correctness changes over time. The benchmark comprises 100 tasks, each deriving from a real-world code repository with a development history spanning an average of 233 days…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
