ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness

Wenxing Zhu; Simeng Qi; Junkui Chen; Yan Xie; Min Huang; Jingkan He; Xiao Wang; Cheng Chen; Sijing Meng; Tianqi Zhang

arXiv:2604.09564·cs.DC·April 14, 2026

ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness

Wenxing Zhu, Simeng Qi, Junkui Chen, Yan Xie, Min Huang, Jingkan He, Xiao Wang, Cheng Chen, Sijing Meng, Tianqi Zhang

PDF

TL;DR

ACE-Bench is a fast, reproducible benchmark for evaluating Azure SDK usage correctness in LLM-based coding agents without cloud provisioning.

Contribution

It introduces a lightweight, execution-free benchmark that enforces API usage patterns and semantic workflows for Azure SDKs, facilitating practical LLM evaluation.

Findings

01

Benchmark reduces evaluation cost and improves repeatability.

02

Retrieval-augmented LLMs show consistent gains from documentation access.

03

Significant cross-model differences in SDK usage correctness.

Abstract

We present ACE-Bench (Azure SDK Coding Evaluation Benchmark), an execution-free benchmark that provides fast, reproducible pass or fail signals for whether large language model (LLM)-based coding agents use Azure SDKs correctly-without provisioning cloud resources or maintaining fragile end-to-end test environments. ACE-Bench turns official Azure SDK documentation examples into self-contained coding tasks and validates solutions with task-specific atomic criteria: deterministic regex checks that enforce required API usage patterns and reference-based LLM-judge checks that capture semantic workflow constraints. This design makes SDK-centric evaluation practical in day-to-day development and CI: it reduces evaluation cost, improves repeatability, and scales to new SDKs and languages as documentation evolves. Using a lightweight coding agent, we benchmark multiple state-of-the-art LLMs and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.