TL;DR
RepoGenesis is a comprehensive benchmark for evaluating end-to-end microservice code generation from Readme files, highlighting current system limitations and providing a platform for future improvements.
Contribution
It introduces the first multilingual, repository-level microservice generation benchmark with extensive data and evaluation metrics, aiding progress in real-world code synthesis.
Findings
Open-source agents achieve up to 73.91% API coverage but low Pass@1 scores.
Best systems have less than 24% Pass@1 accuracy, indicating room for improvement.
Fine-tuned GenesisAgent-8B performs comparably to GPT-5 mini, showing benchmark quality.
Abstract
Large language models and agents have achieved remarkable progress in code generation. However, existing benchmarks focus on isolated function/class-level generation (e.g., ClassEval) or modifications to existing codebases (e.g., SWE-Bench), neglecting complete microservice repository generation that reflects real-world 0-to-1 development workflows. To bridge this gap, we introduce RepoGenesis, the first multilingual benchmark for repository-level end-to-end web microservice generation, comprising 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified through a "review-rebuttal" quality assurance process. We evaluate open-source agents (e.g., DeepCode) and commercial IDEs (e.g., Cursor) using Pass@1, API Coverage (AC), and Deployment Success Rate (DSR). Results reveal that despite high AC (up to 73.91%) and DSR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
