Neural Sabermetrics with World Model: Play-by-play Predictive Modeling with Large Language Model
Young Jin Ahn, Yiyang Du, Zheyuan Zhang, Haisen Kang

TL;DR
This paper introduces a large language model-based world model for baseball that predicts game events and outcomes play-by-play, outperforming previous models in accuracy and providing a unified generative framework.
Contribution
The work presents the first LLM-based baseball world model trained on extensive MLB data, capable of multi-aspect game prediction within a single framework.
Findings
Predicts 64% of next pitches within a plate appearance
Predicts 78% of batter swing decisions
Outperforms existing neural baselines in accuracy
Abstract
Classical sabermetrics has profoundly shaped baseball analytics by summarizing long histories of play into compact statistics. While these metrics are invaluable for valuation and retrospective analysis, they do not define a generative model of how baseball games unfold pitch by pitch, leaving most existing approaches limited to single-step prediction or post-hoc analysis. In this work, we present Neural Sabermetrics with World Model, a Large Language Model (LLM) based play-by-play world model for baseball. We cast baseball games as long auto-regressive sequences of events and continuously pretrain a single LLM on more than ten years of Major League Baseball (MLB) tracking data, comprising over seven million pitch sequences and approximately three billion tokens. The resulting model is capable of predicting multiple aspects of game evolution within a unified framework. We evaluate our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSports Analytics and Performance · Gaussian Processes and Bayesian Inference · Sports Dynamics and Biomechanics
