ContractBench: Can LLM Agents Preserve Observation Contracts?
Jicheng Wang, Yifeng He, Zili Wang, Hanwen Xing, Arkaprava De, Hao Chen

TL;DR
ContractBench introduces a benchmark to evaluate how well LLM agents preserve observation contracts, revealing current models' limitations and the impact of model size and training on compliance.
Contribution
This work presents ContractBench, a novel benchmark with 33 tasks to measure observation contract compliance in LLM agents, highlighting emergent failures and scaling behaviors.
Findings
No model exceeds 80% compliance, with Claude-Opus-4.6 at 77.8%.
A sharp capability cliff exists in Qwen 3.5 between 4B and 9B models.
Scaling in GPT-5 family shows non-monotonic effects, with regression in compliance.
Abstract
Tool-augmented LLM agents call APIs whose intermediate outputs, such as presigned URLs, session tokens, and OAuth state parameters, are observation contracts: artifacts whose later use is constrained by the external system that produced them. We show that observation contract compliance (preserving the temporal validity and byte-level integrity) is an emergent, regression-prone capability: it is neither guaranteed by general tool-use ability nor consistently improved by larger or newer models. To measure this, we introduce ContractBench, a benchmark of 33 dual-axis tasks that probe two orthogonal failure modes no existing benchmark evaluates: validity failures (using an artifact after expiry) and integrity failures (corrupting an artifact's bytes through the observation-to-action pipeline). Our evaluation is deterministic and programmatic, with a virtual clock controlling time and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
