ScarfBench: A Benchmark for Cross-Framework Application Migration in Enterprise Java

Advait Pavuluri; Bridget McGinn; Ashita Saxena; George Safta; Srikanth Tamilselvam; Raju Pavuluri; Michele Merler; Baishakhi Ray; Rahul Krishna

arXiv:2605.06754·cs.SE·May 19, 2026

ScarfBench: A Benchmark for Cross-Framework Application Migration in Enterprise Java

Advait Pavuluri, Bridget McGinn, Ashita Saxena, George Safta, Srikanth Tamilselvam, Raju Pavuluri, Michele Merler, Baishakhi Ray, Rahul Krishna

PDF

1 Repo

TL;DR

ScarfBench is a new benchmark designed to evaluate the ability of AI agents to perform behavior-preserving cross-framework refactoring of enterprise Java applications across multiple frameworks, revealing current limitations.

Contribution

It introduces a comprehensive benchmark with expert-crafted tasks and an evaluation framework for cross-framework Java application migration, which was previously unmeasured.

Findings

01

The best AI agent achieves only 15.3% test pass rate on focused migrations.

02

Full behavioral equivalence is achieved in only one of 204 tasks.

03

Migration difficulty varies by framework pair and architectural layer.

Abstract

Java remains central to enterprise software, and many applications outlive their original architecture. Migrating them across frameworks is a behavior-preserving refactoring spanning build configuration, dependency injection, persistence, request handling, and deployment. Existing software-engineering benchmarks cover bug fixing, feature implementation, and language or version modernization, but leave cross-framework refactoring largely unmeasured. We introduce ScarfBench, a benchmark for behavior-preserving cross-framework refactoring of enterprise Java applications. It is built from expert-written implementation triples across Spring, Jakarta EE, and Quarkus: 34 applications (29 focused single-layer, 5 whole) yielding 102 variants (~151K lines across 1946 source and test files) and 204 directed refactoring tasks. Each task gives an agent a working source application and a target…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://scarfbench.info
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.