TL;DR
KellyBench is a new environment for evaluating long-term decision-making in sports betting, highlighting the challenges and current limitations of machine learning models in complex, dynamic markets.
Contribution
Introduces KellyBench, a comprehensive benchmark for long-horizon sequential decision-making in sports betting, with detailed data and evaluation protocols.
Findings
All evaluated models lose money on average over the season.
The best model achieves an average return of -8%.
Models are less sophisticated than human experts according to a rubric.
Abstract
Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBench, an environment for evaluating sequential decision-making in sports betting markets. Agents are placed in a sequential simulation of the 2023-24 English Premier League season and tasked with maximising their long-term bankroll growth. They are given detailed historical data, including advanced statistics, lineups, and public odds. To succeed they must build machine learning models, identify edge in public markets, and adapt as the environment changes over time. We find that all frontier models evaluated lose money on average over the course of the season for five seeds. The best performing model achieves an average return of -8%, and many models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
