SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

Qingnan Ren; Shun Zou; Shiting Huang; Ziao Zhang; Kou Shi; Zhen Fang; Yiming Zhao; Yu Zeng; Qisheng Su; Lin Chen; Yong Wang; Zehui Chen; Xiangxiang Chu; and Feng Zhao

arXiv:2605.17526·cs.SE·May 19, 2026

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

Qingnan Ren, Shun Zou, Shiting Huang, Ziao Zhang, Kou Shi, Zhen Fang, Yiming Zhao, Yu Zeng, Qisheng Su, Lin Chen, Yong Wang, Zehui Chen, Xiangxiang Chu, and Feng Zhao

PDF

1 Repo

TL;DR

SaaSBench is a comprehensive benchmark designed to evaluate AI coding agents in realistic enterprise SaaS environments, highlighting system configuration and integration as key challenges.

Contribution

It introduces SaaSBench, the first benchmark to assess AI agents on complex, heterogeneous SaaS tasks, and provides insights into the main bottlenecks faced by current models.

Findings

01

Over 95% of failures occur before reaching business logic.

02

Models often overconfidence leads to premature halts or ineffective debugging.

03

System configuration and integration are the primary bottlenecks.

Abstract

As autonomous coding agents become capable of handling increasingly long-horizon tasks, they have gradually demonstrated the potential to complete end-to-end software development. Although existing benchmarks have recently evolved from localized code editing to from-scratch project generation, they remain confined to structurally simplified, single-stack applications. Consequently, they fail to capture the heterogeneous environments, full-stack orchestration, and system-level complexity of real enterprise Software as a Service (SaaS) systems, leaving a critical gap in assessing agents under realistic engineering constraints. To fill this gap, we introduce SaaSBench, the first benchmark designed to explore the boundaries of AI agents in enterprise SaaS engineering. Spanning 30 complex tasks across 6 SaaS domains with 5,370 validation nodes, it incorporates 8 programming languages, 6…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ShadeCloak/SaaSbench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.