BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting
Zhensheng Wang, Wenmian Yang, Qingtai Wu, Lequan Ma, Yiquan Zhang, and Weijia Jia

TL;DR
BacktestBench is a large-scale benchmark dataset designed to evaluate and advance the use of Large Language Models in automating quantitative trading strategy backtesting, addressing current scalability and technical barriers.
Contribution
The paper introduces BacktestBench, the first extensive benchmark dataset for automated quantitative backtesting, and proposes AutoBacktest, a multi-agent system for translating natural language strategies into reproducible backtests.
Findings
Evaluation of 23 LLMs reveals key performance factors.
Grounded verification improves backtesting accuracy.
Standardized indicator representations are crucial.
Abstract
Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large-scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
