Dataset Generation Patterns for Evaluating Knowledge Graph Construction

Markus Schr\"oder; Christian Jilek; Andreas Dengel

arXiv:2104.13576·cs.DB·August 2, 2021

Dataset Generation Patterns for Evaluating Knowledge Graph Construction

Markus Schr\"oder, Christian Jilek, Andreas Dengel

PDF

1 Repo

TL;DR

This paper introduces Data Sprout, a generator that synthetically creates realistic datasets for evaluating knowledge graph construction, based on patterns observed in real industry spreadsheets.

Contribution

It identifies 11 data generation patterns from real spreadsheets and implements them in Data Sprout to produce authentic-looking synthetic datasets.

Findings

01

Data Sprout successfully reproduces real spreadsheet patterns.

02

Synthetic datasets mimic real data for evaluation purposes.

03

Patterns improve the realism of generated data.

Abstract

Confidentiality hinders the publication of authentic, labeled datasets of personal and enterprise data, although they could be useful for evaluating knowledge graph construction approaches in industrial scenarios. Therefore, our plan is to synthetically generate such data in a way that it appears as authentic as possible. Based on our assumption that knowledge workers have certain habits when they produce or manage data, generation patterns could be discovered which can be utilized by data generators to imitate real datasets. In this paper, we initially derived 11 distinct patterns found in real spreadsheets from industry and demonstrate a suitable generator called Data Sprout that is able to reproduce them. We describe how the generator produces spreadsheets in general and what altering effects the implemented patterns have.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mschroeder-github/datasprout
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.