Toward Reliable Evaluation of LLM-Based Financial Multi-Agent Systems: Taxonomy, Coordination Primacy, and Cost Awareness
Phat Nguyen, Thang Pham

TL;DR
This paper proposes a comprehensive framework and evaluation standards for assessing LLM-based financial multi-agent systems, emphasizing the importance of coordination protocols and addressing common evaluation pitfalls.
Contribution
It introduces a taxonomy, formulates the Coordination Primacy Hypothesis, and develops metrics and standards to improve evaluation credibility in the field.
Findings
Coordination protocol design significantly impacts trading performance.
Common evaluation biases can reverse reported returns.
The Coordination Breakeven Spread helps assess genuine multi-agent coordination value.
Abstract
Multi-agent systems based on large language models (LLMs) for financial trading have grown rapidly since 2023, yet the field lacks a shared framework for understanding what drives performance or for evaluating claims credibly. This survey makes three contributions. First, we introduce a four-dimensional taxonomy, covering architecture pattern, coordination mechanism, memory architecture, and tool integration; applied to 12 multi-agent systems and two single-agent baselines. Second, we formulate the Coordination Primacy Hypothesis (CPH): inter-agent coordination protocol design is a primary driver of trading decision quality, often exerting greater influence than model scaling. CPH is presented as a falsifiable research hypothesis supported by tiered structural evidence rather than as an empirically validated conclusion; its definitive validation requires evaluation infrastructure that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
