LLM Agents Do Not Replicate Human Market Traders: Evidence From Experimental Finance

Thomas Henning; Siddhartha M. Ojha; Ross Spoon; Jiatong Han; Colin F. Camerer

arXiv:2502.15800·q-fin.TR·October 14, 2025

LLM Agents Do Not Replicate Human Market Traders: Evidence From Experimental Finance

Thomas Henning, Siddhartha M. Ojha, Ross Spoon, Jiatong Han, Colin F. Camerer

PDF

Open Access 3 Reviews

TL;DR

This study tests whether Large Language Models can replicate human market behaviors in experimental finance settings, finding that LLMs tend to act rationally and do not produce typical human-like bubbles or crashes.

Contribution

It provides empirical evidence that LLMs do not naturally exhibit complex human market behaviors, challenging their use as models for human financial decision-making.

Findings

01

LLMs price assets near fundamental value

02

LLMs show less trading strategy variance than humans

03

LLMs do not produce large emergent bubbles

Abstract

This paper explores how Large Language Models (LLMs) behave in a classic experimental finance paradigm widely known for eliciting bubbles and crashes in human participants. We adapt an established trading design, where traders buy and sell a risky asset with a known fundamental value, and introduce several LLM-based agents, both in single-model markets (all traders are instances of the same LLM) and in mixed-model "battle royale" settings (multiple LLMs competing in the same market). Our findings reveal that LLMs generally exhibit a "textbook-rational" approach, pricing the asset near its fundamental value, and show only a muted tendency toward bubble formation. Further analyses indicate that LLM-based agents display less trading strategy variance in contrast to humans. Taken together, these results highlight the risk of relying on LLM-only data to replicate human-driven market…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

The experimental design allows for a sensible comparison of human versus LLM behavior. Even though the LLMs tested are relatively outdated at this point, the fact that a broad array of different LLMs from different providers are tested, and the separation between human and LLM behavior is so clear, means the results strike me as credible and generalizable.

Weaknesses

The textual trading strategy analysis is interesting but perhaps a little rudimentary, focusing mostly on keyword matching. Moreover, the results at the start of Section 6.1 (that the LLMs and humans all write in different styles) is not really surprising. This analysis would be strengthened by, e.g. (1) a more fine-grained semantic text analysis, (2) additional experiments in the style of Section 7, e.g. checking how LLMs' trading behavior changes if the content of the insight/plan part of the

Reviewer 02Rating 8Confidence 4

Strengths

1. First systematic comparison of LLMs and humans in endogenous experimental markets, bridging behavioral finance and AI alignment 2. Rigorous methodology: a lot of markets vs. 6 LLM models (Claude-3.5, GPT-4o, etc.), with controls for dividend shocks and experience. 3. Well-structured with clear visualizations and statistical tests. 4. Challenges the use of off-the-shelf LLMs as human proxies in finance experiments.

Weaknesses

Are larger models (e.g., GPT-4 Turbo, Claude 3 Opus) and LLMs that may exhibit different behaviors excluded? Why LLMs are anchored to fundamentals is not explored. The simplified single-asset design lacks real-world characteristics (e.g., short selling, information asymmetry). The incentives of human participants (e.g., monetary rewards) may not align with the "profit maximization" imperative of LLMs. The impact of imperative engineering (e.g., explicit bubble-inducing directives) is not exp

Reviewer 03Rating 2Confidence 4

Strengths

The paper covers several topics, there's comparison to human baselines and careful analysis.

Weaknesses

I'm a bit unsold on the motivation and contributions. - First of all, the exact types of experimental data and 'correct'/'expected' behavior is well within the training data of the LLM. In addition, prompts in the experiment disclose redemption value / fundamental value mechanisms, so it is largely unsurprising to observe the discovered LLM phenomenon. In other words, "textbook-rational" seem unsurprising given that LLMs are trained on the said textbooks, as well as the CoT reasoning paradigm t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCorporate Finance and Governance · Complex Systems and Time Series Analysis · Private Equity and Venture Capital