ContractBench: Can LLM Agents Preserve Observation Contracts?

Jicheng Wang; Yifeng He; Zili Wang; Hanwen Xing; Arkaprava De; Hao Chen

arXiv:2605.17281·cs.SE·May 19, 2026

ContractBench: Can LLM Agents Preserve Observation Contracts?

Jicheng Wang, Yifeng He, Zili Wang, Hanwen Xing, Arkaprava De, Hao Chen

PDF

TL;DR

ContractBench introduces a benchmark to evaluate how well LLM agents preserve observation contracts, revealing current models' limitations and the impact of model size and training on compliance.

Contribution

This work presents ContractBench, a novel benchmark with 33 tasks to measure observation contract compliance in LLM agents, highlighting emergent failures and scaling behaviors.

Findings

01

No model exceeds 80% compliance, with Claude-Opus-4.6 at 77.8%.

02

A sharp capability cliff exists in Qwen 3.5 between 4B and 9B models.

03

Scaling in GPT-5 family shows non-monotonic effects, with regression in compliance.

Abstract

Tool-augmented LLM agents call APIs whose intermediate outputs, such as presigned URLs, session tokens, and OAuth state parameters, are observation contracts: artifacts whose later use is constrained by the external system that produced them. We show that observation contract compliance (preserving the temporal validity and byte-level integrity) is an emergent, regression-prone capability: it is neither guaranteed by general tool-use ability nor consistently improved by larger or newer models. To measure this, we introduce ContractBench, a benchmark of 33 dual-axis tasks that probe two orthogonal failure modes no existing benchmark evaluates: validity failures (using an artifact after expiry) and integrity failures (corrupting an artifact's bytes through the observation-to-action pipeline). Our evaluation is deterministic and programmatic, with a virtual clock controlling time and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.