SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents

Zhirui Zhang; Hongbo Zhang; Haoxiang Fei; Zhiyuan Bao; Yubin Chen; Zhengyu Lei; Ziyue Liu; Yixuan Sun; Mingkun Xiao; Zihang Ye; Yu Zhang; Hongcheng Zhu; Yuxiang Wen; Heung-Yeung Shum

arXiv:2602.09447·cs.SE·February 12, 2026

SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents

Zhirui Zhang, Hongbo Zhang, Haoxiang Fei, Zhiyuan Bao, Yubin Chen, Zhengyu Lei, Ziyue Liu, Yixuan Sun, Mingkun Xiao, Zihang Ye, Yu Zhang, Hongcheng Zhu, Yuxiang Wen, Heung-Yeung Shum

PDF

Open Access

TL;DR

This paper introduces SWE-AGI, a benchmark for evaluating AI agents' ability to autonomously build complex software from specifications, highlighting current capabilities and challenges in AI-driven software engineering.

Contribution

The paper presents SWE-AGI, a novel benchmark for specification-driven software construction using MoonBit, and evaluates state-of-the-art models' performance on complex, standards-based tasks.

Findings

01

GPT-5.3-Codex achieves 86.4% task completion

02

Performance drops significantly on more difficult tasks

03

Code reading becomes the main bottleneck in large codebases

Abstract

Although large language models (LLMs) have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question. We introduce SWE-AGI, an open-source benchmark for evaluating end-to-end, specification-driven construction of software systems written in MoonBit. SWE-AGI tasks require LLM-based agents to implement parsers, interpreters, binary decoders, and SAT solvers strictly from authoritative standards and RFCs under a fixed API scaffold. Each task involves implementing 1,000-10,000 lines of core logic, corresponding to weeks or months of engineering effort for an experienced human developer. By leveraging the nascent MoonBit ecosystem, SWE-AGI minimizes data leakage, forcing agents to rely on long-horizon architectural reasoning rather than code retrieval. Across frontier models, gpt-5.3-codex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software System Performance and Reliability · Model-Driven Software Engineering Techniques