TL;DR
ScarfBench is a new benchmark designed to evaluate the ability of AI agents to perform behavior-preserving cross-framework refactoring of enterprise Java applications across multiple frameworks, revealing current limitations.
Contribution
It introduces a comprehensive benchmark with expert-crafted tasks and an evaluation framework for cross-framework Java application migration, which was previously unmeasured.
Findings
The best AI agent achieves only 15.3% test pass rate on focused migrations.
Full behavioral equivalence is achieved in only one of 204 tasks.
Migration difficulty varies by framework pair and architectural layer.
Abstract
Java remains central to enterprise software, and many applications outlive their original architecture. Migrating them across frameworks is a behavior-preserving refactoring spanning build configuration, dependency injection, persistence, request handling, and deployment. Existing software-engineering benchmarks cover bug fixing, feature implementation, and language or version modernization, but leave cross-framework refactoring largely unmeasured. We introduce ScarfBench, a benchmark for behavior-preserving cross-framework refactoring of enterprise Java applications. It is built from expert-written implementation triples across Spring, Jakarta EE, and Quarkus: 34 applications (29 focused single-layer, 5 whole) yielding 102 variants (~151K lines across 1946 source and test files) and 204 directed refactoring tasks. Each task gives an agent a working source application and a target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
