KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

Thomas Grady; Kip Parker; Iliyan Zarov; Henry Course; Chengxi Taylor; and Ross Taylor

arXiv:2604.27865·cs.AI·May 1, 2026

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

Thomas Grady, Kip Parker, Iliyan Zarov, Henry Course, Chengxi Taylor, and Ross Taylor

PDF

1 Repo

TL;DR

KellyBench is a new environment for evaluating long-term decision-making in sports betting, highlighting the challenges and current limitations of machine learning models in complex, dynamic markets.

Contribution

Introduces KellyBench, a comprehensive benchmark for long-horizon sequential decision-making in sports betting, with detailed data and evaluation protocols.

Findings

01

All evaluated models lose money on average over the season.

02

The best model achieves an average return of -8%.

03

Models are less sophisticated than human experts according to a rubric.

Abstract

Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBench, an environment for evaluating sequential decision-making in sports betting markets. Agents are placed in a sequential simulation of the 2023-24 English Premier League season and tasked with maximising their long-term bankroll growth. They are given detailed historical data, including advanced statistics, lineups, and public odds. To succeed they must build machine learning models, identify edge in public markets, and adapt as the environment changes over time. We find that all frontier models evaluated lose money on average over the course of the season for five seeds. The best performing model achieves an average return of -8%, and many models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://openreward.ai/GeneralReasoning/KellyBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.