Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle

Kevin Song

arXiv:2603.18642·cs.LG·March 20, 2026

Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle

Kevin Song

PDF

Open Access

TL;DR

This paper evaluates model-free policy optimization methods in a rigorously defined blackjack environment with dynamic action masking, comparing their sample efficiency and policy accuracy against an exact dynamic programming oracle.

Contribution

It introduces a benchmark for discrete stochastic control with dynamic masking and compares the effectiveness of three model-free optimizers against an exact oracle.

Findings

01

REINFORCE outperformed other methods in sample efficiency.

02

All methods showed significant cell-conditional regret.

03

Optimal bet sizing collapses to minimum without counting, increasing volatility.

Abstract

Infinite-shoe casino blackjack provides a rigorous, exactly verifiable benchmark for discrete stochastic control under dynamically masked actions. Under a fixed Vegas-style ruleset (S17, 3:2 payout, dealer peek, double on any two, double after split, resplit to four), an exact dynamic programming (DP) oracle was derived over 4,600 canonical decision cells. This oracle yielded ground-truth action values, optimal policy labels, and a theoretical expected value (EV) of -0.00161 per hand. To evaluate sample-efficient policy recovery, three model-free optimizers were trained via simulated interaction: masked REINFORCE with a per-cell exponential moving average baseline, simultaneous perturbation stochastic approximation (SPSA), and the cross-entropy method (CEM). REINFORCE was the most sample-efficient, achieving a 46.37% action-match rate and an EV of -0.04688 after 10^6 hands,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Sports Analytics and Performance