AI-Driven Generation of Data Contracts in Modern Data Engineering Systems
Harshraj Bhoite

TL;DR
This paper introduces an AI-driven system using large language models to automatically generate data contracts, significantly reducing manual effort and improving accuracy in complex data pipelines.
Contribution
It presents a novel framework that fine-tunes LLMs for structured data domain adaptation and integrates it into data platforms for scalable contract automation.
Findings
High accuracy in contract generation
Over 70% reduction in manual workload
Effective integration with data platforms
Abstract
Data contracts formalize agreements between data producers and consumers regarding schema, semantics, and quality expectations. As data pipelines grow in complexity, manual authoring and maintenance of contracts becomes error-prone and labor-intensive. We present an AI-driven framework for automatic data contract generation using large language models (LLMs). Our system leverages parameter-efficient fine-tuning methods, including LoRA and PEFT, to adapt LLMs to structured data domains. The models take sample data or schema descriptions and output validated contract definitions in formats such as JSON Schema and Avro. We integrate this framework into modern data platforms (e.g., Databricks, Snowflake) to automate contract enforcement at scale. Experimental results on synthetic and real-world datasets demonstrate that the fine-tuned LLMs achieve high accuracy in generating valid contracts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
