ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness
Wenxing Zhu, Simeng Qi, Junkui Chen, Yan Xie, Min Huang, Jingkan He, Xiao Wang, Cheng Chen, Sijing Meng, Tianqi Zhang

TL;DR
ACE-Bench is a fast, reproducible benchmark for evaluating Azure SDK usage correctness in LLM-based coding agents without cloud provisioning.
Contribution
It introduces a lightweight, execution-free benchmark that enforces API usage patterns and semantic workflows for Azure SDKs, facilitating practical LLM evaluation.
Findings
Benchmark reduces evaluation cost and improves repeatability.
Retrieval-augmented LLMs show consistent gains from documentation access.
Significant cross-model differences in SDK usage correctness.
Abstract
We present ACE-Bench (Azure SDK Coding Evaluation Benchmark), an execution-free benchmark that provides fast, reproducible pass or fail signals for whether large language model (LLM)-based coding agents use Azure SDKs correctly-without provisioning cloud resources or maintaining fragile end-to-end test environments. ACE-Bench turns official Azure SDK documentation examples into self-contained coding tasks and validates solutions with task-specific atomic criteria: deterministic regex checks that enforce required API usage patterns and reference-based LLM-judge checks that capture semantic workflow constraints. This design makes SDK-centric evaluation practical in day-to-day development and CI: it reduces evaluation cost, improves repeatability, and scales to new SDKs and languages as documentation evolves. Using a lightweight coding agent, we benchmark multiple state-of-the-art LLMs and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
