Shift schema drift left: policy-aware compile-time contracts for typed JVM and Spark pipelines
Vittal Mirji

TL;DR
This paper introduces a Scala 3 framework that enforces schema compatibility policies at compile time and runtime for Spark data pipelines, enhancing reliability against schema drift.
Contribution
It presents a novel compile-time and runtime policy-aware contract system that ensures schema compatibility in Spark pipelines, bridging gaps in existing enforcement methods.
Findings
Proves producer-to-contract structural compatibility at compile time.
Derives Spark schemas directly from contract types.
Re-checks DataFrame schemas at sink boundary before writing.
Abstract
Schema drift in data pipelines is often caught only when a job touches real data. Typed-Dataset layers close part of this gap but require wholesale adoption; table-level enforcement systems close another part but operate at write time against a stored schema. We present a small Scala 3 framework that occupies the seam: it proves producer-to-contract structural compatibility under explicit policies at compile time, derives Spark schemas from the same contract types, and re-checks the actual DataFrame schema at the sink boundary before write. The artifact fuses the compile-time witness with a policy-aware runtime comparator that adds a nested-collection-optionality check Spark's built-in comparators omit and implements structural subset semantics for backward- and forward-compatible field sets. Evaluation covers compile-time proofs, runtime policy tests, builder-path end-to-end tests, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
