Structure of Pitch-Pattern Motifs in Major League Baseball
Youngjai Park, Cheawoon Lim, Seung-Woo Son, and Mi Jin Lee

TL;DR
This study analyzes the sequential patterns of pitches in Major League Baseball using large-scale data, revealing a language-like structure that is stable but only weakly related to game outcomes.
Contribution
It introduces a comprehensive analysis of pitch-pattern motifs across millions of pitches, uncovering stable, language-like organizational laws in MLB pitch sequences.
Findings
Pitch-sequence diversity correlates weakly with performance metrics.
Pitch motifs follow Zipf's and Heaps' laws, indicating a language-like structure.
Motif usage shows limited predictive power for game outcomes.
Abstract
Baseball consists of two teams alternating between batting and fielding while competing to score runs through sequential pitching events. Recent advances in tracking technology have enabled all Major League Baseball (MLB) clubs to record every pitch with high resolution, yet most quantitative studies have primarily emphasized single-pitch metrics, leaving the role of sequential structure less explored. Here, we examine pitch-pattern motifs of multiple lengths using approximately 12.4 million Statcast pitch recordings from the 2008-2025 MLB regular seasons at two complementary scales. At the macroscale, we quantify pitch-sequence diversity using the Shannon entropy and inverse Simpson index and examine their relationships with earned run average and wins. At the microscale, we compare hit and out frequencies across pitch-pattern motifs. Rather than identifying outcome-determining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
