Rule-adhering synthetic data -- the lingua franca of learning
Michael Platzer, Ivona Krchova

TL;DR
This paper introduces a method for creating rule-adhering synthetic data that incorporates domain expertise, serving as a universal learning resource for both humans and machines, demonstrated on a public dataset.
Contribution
It presents a novel approach to integrating domain rules into synthetic data generation, enhancing the data's statistical and domain-specific properties.
Findings
Synthetic data reflects domain rules accurately.
Improved performance in downstream ML tasks.
Enhanced interpretability of synthetic data.
Abstract
AI-generated synthetic data allows to distill the general patterns of existing data, that can then be shared safely as granular-level representative, yet novel data samples within the original semantics. In this work we explore approaches of incorporating domain expertise into the data synthesis, to have the statistical properties as well as pre-existing domain knowledge of rules be represented. The resulting synthetic data generator, that can be probed for any number of new samples, can then serve as a common source of intelligence, as a lingua franca of learning, consumable by humans and machines alike. We demonstrate the concept for a publicly available data set, and evaluate its benefits via descriptive analysis as well as a downstream ML model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Natural Language Processing Techniques · Data Quality and Management
