Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark
Zeyu Bai

TL;DR
The paper introduces Spark Policy Toolkit, enabling scalable, semantics-preserving policy learning in Spark through native primitives that improve inference speed and maintain policy integrity at large scale.
Contribution
It presents two Spark-native primitives, mapInPandas/Arrow and collect-less split search, governed by a fixed-input semantic contract, enhancing scalable policy learning.
Findings
mapInArrow achieves 4.72M rows/sec at 10M matched rows
collect-less split search remains valid from F=10 to F=1000 with 124,000 candidates
backend choice varies with workload, affecting primitive performance
Abstract
Custom policy-learning pipelines in Spark fail for two coupled systems reasons: rowwise Python execution makes inference impractical, and driver-side candidate materialization makes split search fragile at feature scale. We present Spark Policy Toolkit, a semantics-governed systems toolkit for scalable policy learning in Spark. The toolkit provides two Spark-native primitives: partition-initialized vectorized inference through mapInPandas and mapInArrow, and collect-less split search that scores candidates on executors. Both primitives are governed by one fixed-input semantic contract: the same rows, feature order, treatment vocabulary, preprocessing manifest, and split boundaries must preserve per-row score vectors, best-split decisions, and end-to-end learned policy outputs. The evaluation combines practical baseline ladders, backend parity checks, measured split-search scale results,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
