MarketBench: Evaluating AI Agents as Market Participants

Andrey Fradkin; Rohit Krishnan

arXiv:2604.23897·cs.AI·April 28, 2026

MarketBench: Evaluating AI Agents as Market Participants

Andrey Fradkin, Rohit Krishnan

PDF

TL;DR

MarketBench is a benchmark designed to evaluate AI agents' ability to participate effectively in markets, focusing on their self-assessment accuracy and calibration, demonstrated through experiments with LLMs on a software engineering task set.

Contribution

We introduce MarketBench, a novel benchmark for assessing AI agents' market participation capabilities, highlighting calibration issues and the impact of self-assessment on market coordination.

Findings

01

LLMs are miscalibrated on success probability and token usage.

02

Adding prior capability information modestly improves calibration.

03

Self-assessment is a key bottleneck for market-based AI coordination.

Abstract

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self-reports diverge from a full-information allocation. A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full-information benchmark. We also document the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.