Towards Scalable Generation of Realistic Test Data for Duplicate Detection
Fabian Panse, Wolfram Wingerath, Benjamin Wollmer

TL;DR
This paper introduces a new approach for generating large, realistic test datasets with complex schemas and error patterns to improve duplicate detection testing.
Contribution
It presents a scalable test data generator capable of producing complex, realistic datasets, addressing limitations of existing small-scale solutions.
Findings
Enables generation of large, complex test datasets
Produces more realistic error patterns
Easy to use for inexperienced users
Abstract
Due to the increasing volume, volatility, and diversity of data in virtually all areas of our lives, the ability to detect duplicates in potentially linked data sources is more important than ever before. However, while research is already intensively engaged in adapting duplicate detection algorithms to the changing circumstances, existing test data generators are still designed for small -- mostly relational -- datasets and can thus fulfill their intended task only to a limited extent. In this report, we present our ongoing research on a novel approach for test data generation that -- in contrast to existing solutions -- is able to produce large test datasets with complex schemas and more realistic error patterns while being easy to use for inexperienced users.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Web Data Mining and Analysis · Data-Driven Disease Surveillance
