TERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate
Erica Zhang, Fangzhao Zhang, Aneesh Pappu, Batu El, Jose Blanchet, Susan Athey, Jiashuo Liu, James Zou

TL;DR
Terms-Bench introduces a Bayesian-game framework for diagnosing large language model negotiation agents, enabling detailed failure analysis beyond simple deal rate metrics.
Contribution
It provides a novel environment that makes the negotiation process itself the verifier, allowing for diagnostic insights into agent failures and strengths.
Findings
Frontier models saturate deal rate but differ in surplus extraction and cue use.
Agents show divergence in belief calibration and compliance.
Prior benchmarks masked agent-specific bargaining bottlenecks.
Abstract
Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction under hidden preferences, strategic communication, and binding constraints. These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier. Existing LLM negotiation evaluations rely on LLM-vs.-LLM interaction or aggregate outcomes such as deal rate, leaving failures opaque. We introduce Terms-Bench, short for Testbed for Economic Reasoning in Multi-turn Strategy, a Bayesian-game framework that makes the environment itself the verifier by specifying the counterpart's latent type, policy, and payoff structure. We instantiate it in bilateral price negotiation, where the counterpart's private state and simulator policy are hidden…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
