Data Partitioning for Parallel Entity Matching
Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Gro{\ss}, Hanna, K\"opcke, Erhard Rahm

TL;DR
This paper explores data partitioning strategies to enable efficient parallel entity matching on distributed systems, aiming to reduce execution time and improve scalability for web data integration.
Contribution
It introduces novel data partitioning strategies supporting blocking and parallel matching, along with a distributed infrastructure for scalable entity matching.
Findings
Partitioning impacts communication overhead and memory use.
Blocking combined with parallel matching improves efficiency.
Caching and affinity scheduling enhance performance.
Abstract
Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose different strategies to partition the input data and generate multiple match tasks that can be independently executed. One of our strategies supports both, blocking to reduce the search space for matching and parallel matching to improve efficiency. Special attention is given to the number and size of data partitions as they impact the overall communication overhead and memory requirements of individual match tasks. We have developed a service-based distributed infrastructure for the parallel execution of match workflows. We evaluate our approach in detail for different match strategies for matching real-world product data of different web shops. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Service-Oriented Architecture and Web Services
