Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation

Hubert M. Pysklo; Artem Zhuravel; Patrick D. Watson

arXiv:2602.11224·cs.SE·April 29, 2026

Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation

Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson

PDF

1 Repo

TL;DR

Agent-Diff introduces a benchmarking framework for evaluating large language models on enterprise API tasks by combining real API access with sandboxed environments, using a novel state-diff contract for precise success measurement.

Contribution

It presents a new benchmarking approach that balances real-world API interaction with controlled environment evaluation through state-diff contracts and containerized sandboxing.

Findings

01

Benchmarked nine LLMs on 224 enterprise software tasks.

02

Demonstrated the framework's robustness with ablation experiments.

03

Showed that API documentation access influences model performance.

Abstract

We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world productivity software API tasks via code execution. Agentic LLM performance varies due to differences in models, external tool access, prompt structures, and agentic frameworks. Benchmarks must make fundamental trade-offs between a sandboxed approach that controls for variation in software environments and more ecologically valid approaches employing real services. Agent-Diff attempts to capture the desirable features of both of these approaches by including access to the real API interfaces for software services while sandboxing the environment in which calls are made, processed, and evaluated. This approach relies on two key innovations. The first is a novel state-diff contract, which separates process from outcome - rather than fuzzy trace or parameter matching, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

agent-diff-bench/agent-diff
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.