CBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication Tasks
Anish Das Sarma, Ankur Jain, Ashwin Machanavajjhala, Philip Bohannon

TL;DR
CBLOCK is an automated system that learns hash functions for large-scale, multi-schema de-duplication tasks, improving efficiency and recall through hierarchical blocking and post-processing in distributed environments.
Contribution
It introduces a novel automated blocking method that learns hash functions from data and schemas, adaptable to various constraints and scalable for web-scale datasets.
Findings
Successfully applied to large Yahoo datasets with over 140K movies and 40K restaurants.
Demonstrates improved recall and efficiency in de-duplication tasks.
Supports hierarchical blocking and post-processing to optimize performance.
Abstract
De-duplication---identification of distinct records referring to the same real-world entity---is a well-known challenge in data integration. Since very large datasets prohibit the comparison of every pair of records, {\em blocking} has been identified as a technique of dividing the dataset for pairwise comparisons, thereby trading off {\em recall} of identified duplicates for {\em efficiency}. Traditional de-duplication tasks, while challenging, typically involved a fixed schema such as Census data or medical records. However, with the presence of large, diverse sets of structured data on the web and the need to organize it effectively on content portals, de-duplication systems need to scale in a new dimension to handle a large number of schemas, tasks and data sets, while handling ever larger problem sizes. In addition, when working in a map-reduce framework it is important that canopy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Cloud Data Security Solutions
