TL;DR
This paper argues that reported alpha from end-to-end LLM trading agents should not be considered reliable evidence of deployable trading capability without rigorous validation, due to structural and evaluative issues.
Contribution
It introduces a minimum reporting protocol suite (P1--P6) and a modular alternative for more reliable evaluation of LLM trading agents.
Findings
Current public evidence cannot distinguish robust predictive ability from contamination.
Reported Sharpe ratios may be inflated by unmodeled frictions and short-term biases.
Proposes a structured validation protocol to improve assessment reliability.
Abstract
End-to-end LLM trading agents have moved quickly from research curiosity to a small ecosystem of named systems, including FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, and FLAG-Trader. Several of these report headline Sharpe ratios that would be material if read at face value on a deployment desk, and associated benchmarks such as FinBen report trading-task Sharpe statistics in the same range. The gap between architecture research and deployment claim has been crossed too freely on both sides of the academia--industry divide. We take a position on that gap: reported alpha from end-to-end LLM trading agents should not be treated as deployment evidence. Before such returns can support claims of deployable trading capability, they must survive structural validity tests for temporal integrity, real-world frictions, counterfactual robustness, predictive calibration, numerical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
