QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

Alexey Khoroshilov; Alexey Chernysh; Orkhan Ekhtibarov; Nini Kamkia; Dmitry Zmitrovich

arXiv:2604.15151·cs.CL·April 17, 2026

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

Alexey Khoroshilov, Alexey Chernysh, Orkhan Ekhtibarov, Nini Kamkia, Dmitry Zmitrovich

PDF

1 Repo 1 Datasets

TL;DR

QuantCode-Bench is a new benchmark designed to evaluate large language models' ability to generate executable algorithmic trading strategies from natural language descriptions, emphasizing domain-specific logic and API usage.

Contribution

The paper introduces QuantCode-Bench, a comprehensive benchmark with 400 tasks for assessing LLMs in generating trading strategies, and analyzes current models' limitations in this domain.

Findings

01

Current models struggle with operationalizing trading logic and API usage.

02

Success depends on aligning natural language, financial logic, and data behavior.

03

Most failures are not due to syntax but to semantic and operational errors.

Abstract

Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present QuantCode-Bench, a benchmark for the systematic evaluation of modern LLMs in generating strategies for the Backtrader framework from textual descriptions in English. The benchmark contains 400 tasks of varying difficulty collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources. Evaluation is conducted through a multi-stage pipeline that checks syntactic correctness,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

limexailab/QuantCode-Bench
github

Datasets

umaimakhan01/domain-code-bench
dataset· 64 dl
64 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.