Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle
Kevin Song

TL;DR
This paper evaluates model-free policy optimization methods in a rigorously defined blackjack environment with dynamic action masking, comparing their sample efficiency and policy accuracy against an exact dynamic programming oracle.
Contribution
It introduces a benchmark for discrete stochastic control with dynamic masking and compares the effectiveness of three model-free optimizers against an exact oracle.
Findings
REINFORCE outperformed other methods in sample efficiency.
All methods showed significant cell-conditional regret.
Optimal bet sizing collapses to minimum without counting, increasing volatility.
Abstract
Infinite-shoe casino blackjack provides a rigorous, exactly verifiable benchmark for discrete stochastic control under dynamically masked actions. Under a fixed Vegas-style ruleset (S17, 3:2 payout, dealer peek, double on any two, double after split, resplit to four), an exact dynamic programming (DP) oracle was derived over 4,600 canonical decision cells. This oracle yielded ground-truth action values, optimal policy labels, and a theoretical expected value (EV) of -0.00161 per hand. To evaluate sample-efficient policy recovery, three model-free optimizers were trained via simulated interaction: masked REINFORCE with a per-cell exponential moving average baseline, simultaneous perturbation stochastic approximation (SPSA), and the cross-entropy method (CEM). REINFORCE was the most sample-efficient, achieving a 46.37% action-match rate and an EV of -0.04688 after 10^6 hands,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Sports Analytics and Performance
