TERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate

Erica Zhang; Fangzhao Zhang; Aneesh Pappu; Batu El; Jose Blanchet; Susan Athey; Jiashuo Liu; James Zou

arXiv:2605.13909·cs.GT·May 15, 2026

TERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate

Erica Zhang, Fangzhao Zhang, Aneesh Pappu, Batu El, Jose Blanchet, Susan Athey, Jiashuo Liu, James Zou

PDF

TL;DR

Terms-Bench introduces a Bayesian-game framework for diagnosing large language model negotiation agents, enabling detailed failure analysis beyond simple deal rate metrics.

Contribution

It provides a novel environment that makes the negotiation process itself the verifier, allowing for diagnostic insights into agent failures and strengths.

Findings

01

Frontier models saturate deal rate but differ in surplus extraction and cue use.

02

Agents show divergence in belief calibration and compliance.

03

Prior benchmarks masked agent-specific bargaining bottlenecks.

Abstract

Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction under hidden preferences, strategic communication, and binding constraints. These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier. Existing LLM negotiation evaluations rely on LLM-vs.-LLM interaction or aggregate outcomes such as deal rate, leaving failures opaque. We introduce Terms-Bench, short for Testbed for Economic Reasoning in Multi-turn Strategy, a Bayesian-game framework that makes the environment itself the verifier by specifying the counterpart's latent type, policy, and payoff structure. We instantiate it in bilateral price negotiation, where the counterpart's private state and simulator policy are hidden…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.