SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics
Yuchen Cao, Hanlin Zhang, Jacky Wai Keung, Yang Chen, and Linqi Song

TL;DR
SysTradeBench is a comprehensive benchmark for evaluating large language model-generated trading systems, emphasizing iterative development, drift detection, and multi-dimensional performance metrics.
Contribution
Introduces SysTradeBench, an iterative, diagnostics-enabled benchmark for strategy-to-code trading systems, highlighting the role of LLM iteration alongside human oversight.
Findings
Top models achieve over 91.7% validity.
Iteration induces code convergence across strategies.
LLMs excel at rapid prototyping and shallow bug fixes.
Abstract
Large language models (LLMs) are increasingly used as quantitative research copilots to translate natural-language strategy specifications into executable trading code. Yet most existing evaluations either focus on static financial knowledge or summarize performance with a single profitability metric, leaving a gap for benchmarking strategy-to-code trading systems as governed, auditable software. We introduce SysTradeBench (SysTB), an iterative build-test-patch benchmark that evaluates LLM-generated trading systems under drift-aware diagnostics. Given a standardized Base Strategy Doc and frozen semantics, each model must produce (i) a strategy card, (ii) executable code, and (iii) mandatory audit logs. A sandboxed harness runs determinism and anti-leakage checks, detects rule drift across iterations, and returns evidence bundles to support constrained patches. SysTradeBench reports…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
