Towards a Theory of Data-Diff: Optimal Synthesis of Succinct Data Modification Scripts
Tana Wattanawaroon, Stephen Macke, Aditya Parameswaran

TL;DR
This paper develops a theoretical framework for the Data-Diff problem, aiming to find the shortest sequence of data modification operations, similar to SQL UPDATE, to transform one dataset into another efficiently.
Contribution
It provides a formal characterization of the Data-Diff problem, analyzing its complexity and proposing algorithms under various constraints on operations.
Findings
Complexity classifications for different Data-Diff scenarios
Algorithms for optimal data modification sequences
Insights into the computational limits of data transformation
Abstract
This paper addresses the Data-Diff problem: given a dataset and a subsequent version of the dataset, find the shortest sequence of operations that transforms the dataset to the subsequent version, under a restricted family of operations. We consider operations similar to SQL UPDATE, each with a condition (WHERE) that matches a subset of tuples and a modifier (SET) that makes changes to those matched tuples. We characterize the problem based on different constraints on the attributes and the allowed conditions and modifiers, providing complexity classification and algorithms in each case.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Algorithms and Data Compression · Advanced Database Systems and Queries
