Towards Scalable Generation of Realistic Test Data for Duplicate   Detection

Fabian Panse; Wolfram Wingerath; Benjamin Wollmer

arXiv:2312.17324·cs.DB·January 1, 2024·1 cites

Towards Scalable Generation of Realistic Test Data for Duplicate Detection

Fabian Panse, Wolfram Wingerath, Benjamin Wollmer

PDF

Open Access

TL;DR

This paper introduces a new approach for generating large, realistic test datasets with complex schemas and error patterns to improve duplicate detection testing.

Contribution

It presents a scalable test data generator capable of producing complex, realistic datasets, addressing limitations of existing small-scale solutions.

Findings

01

Enables generation of large, complex test datasets

02

Produces more realistic error patterns

03

Easy to use for inexperienced users

Abstract

Due to the increasing volume, volatility, and diversity of data in virtually all areas of our lives, the ability to detect duplicates in potentially linked data sources is more important than ever before. However, while research is already intensively engaged in adapting duplicate detection algorithms to the changing circumstances, existing test data generators are still designed for small -- mostly relational -- datasets and can thus fulfill their intended task only to a limited extent. In this report, we present our ongoing research on a novel approach for test data generation that -- in contrast to existing solutions -- is able to produce large test datasets with complex schemas and more realistic error patterns while being easy to use for inexperienced users.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Web Data Mining and Analysis · Data-Driven Disease Surveillance