Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark

Zeyu Bai

arXiv:2604.25061·cs.DC·April 29, 2026

Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark

Zeyu Bai

PDF

TL;DR

The paper introduces Spark Policy Toolkit, enabling scalable, semantics-preserving policy learning in Spark through native primitives that improve inference speed and maintain policy integrity at large scale.

Contribution

It presents two Spark-native primitives, mapInPandas/Arrow and collect-less split search, governed by a fixed-input semantic contract, enhancing scalable policy learning.

Findings

01

mapInArrow achieves 4.72M rows/sec at 10M matched rows

02

collect-less split search remains valid from F=10 to F=1000 with 124,000 candidates

03

backend choice varies with workload, affecting primitive performance

Abstract

Custom policy-learning pipelines in Spark fail for two coupled systems reasons: rowwise Python execution makes inference impractical, and driver-side candidate materialization makes split search fragile at feature scale. We present Spark Policy Toolkit, a semantics-governed systems toolkit for scalable policy learning in Spark. The toolkit provides two Spark-native primitives: partition-initialized vectorized inference through mapInPandas and mapInArrow, and collect-less split search that scores candidates on executors. Both primitives are governed by one fixed-input semantic contract: the same rows, feature order, treatment vocabulary, preprocessing manifest, and split boundaries must preserve per-row score vectors, best-split decisions, and end-to-end learned policy outputs. The evaluation combines practical baseline ladders, backend parity checks, measured split-search scale results,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.