Scaling up Copy Detection
Xian Li, Xin Luna Dong, Kenneth B. Lyons, Weiyi Meng and, Divesh Srivastava

TL;DR
This paper presents a scalable algorithm for copy detection in structured data that significantly reduces computation time, enabling more efficient truth finding in large data sources.
Contribution
The authors introduce an inverted index-based algorithm with pruning and sampling strategies that greatly improve copy detection scalability on structured data.
Findings
Reduces copy detection time by two to three orders of magnitude.
Enables faster truth finding by integrating scalable copy detection.
Maintains high accuracy with sampling strategies.
Abstract
Recent research shows that copying is prevalent for Deep-Web data and considering copying can significantly improve truth finding from conflicting values. However, existing copy detection techniques do not scale for large sizes and numbers of data sources, so truth finding can be slowed down by one to two orders of magnitude compared with the corresponding techniques that do not consider copying. In this paper, we study {\em how to improve scalability of copy detection on structured data}. Our algorithm builds an inverted index for each \emph{shared} value and processes the index entries in decreasing order of how much the shared value can contribute to the conclusion of copying. We show how we use the index to prune the data items we consider for each pair of sources, and to incrementally refine our results in iterative copy detection. We also apply a sampling strategy with which we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Data Quality and Management · Scientific Computing and Data Management
