Rule-adhering synthetic data -- the lingua franca of learning

Michael Platzer; Ivona Krchova

arXiv:2209.06679·cs.LG·September 15, 2022

Rule-adhering synthetic data -- the lingua franca of learning

Michael Platzer, Ivona Krchova

PDF

Open Access

TL;DR

This paper introduces a method for creating rule-adhering synthetic data that incorporates domain expertise, serving as a universal learning resource for both humans and machines, demonstrated on a public dataset.

Contribution

It presents a novel approach to integrating domain rules into synthetic data generation, enhancing the data's statistical and domain-specific properties.

Findings

01

Synthetic data reflects domain rules accurately.

02

Improved performance in downstream ML tasks.

03

Enhanced interpretability of synthetic data.

Abstract

AI-generated synthetic data allows to distill the general patterns of existing data, that can then be shared safely as granular-level representative, yet novel data samples within the original semantics. In this work we explore approaches of incorporating domain expertise into the data synthesis, to have the statistical properties as well as pre-existing domain knowledge of rules be represented. The resulting synthetic data generator, that can be probed for any number of new samples, can then serve as a common source of intelligence, as a lingua franca of learning, consumable by humans and machines alike. We demonstrate the concept for a publicly available data set, and evaluate its benefits via descriptive analysis as well as a downstream ML model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Natural Language Processing Techniques · Data Quality and Management