Scaling up Copy Detection

Xian Li; Xin Luna Dong; Kenneth B. Lyons; Weiyi Meng and; Divesh Srivastava

arXiv:1503.00309·cs.DB·March 3, 2015·1 cites

Scaling up Copy Detection

Xian Li, Xin Luna Dong, Kenneth B. Lyons, Weiyi Meng and, Divesh Srivastava

PDF

Open Access

TL;DR

This paper presents a scalable algorithm for copy detection in structured data that significantly reduces computation time, enabling more efficient truth finding in large data sources.

Contribution

The authors introduce an inverted index-based algorithm with pruning and sampling strategies that greatly improve copy detection scalability on structured data.

Findings

01

Reduces copy detection time by two to three orders of magnitude.

02

Enables faster truth finding by integrating scalable copy detection.

03

Maintains high accuracy with sampling strategies.

Abstract

Recent research shows that copying is prevalent for Deep-Web data and considering copying can significantly improve truth finding from conflicting values. However, existing copy detection techniques do not scale for large sizes and numbers of data sources, so truth finding can be slowed down by one to two orders of magnitude compared with the corresponding techniques that do not consider copying. In this paper, we study {\em how to improve scalability of copy detection on structured data}. Our algorithm builds an inverted index for each \emph{shared} value and processes the index entries in decreasing order of how much the shared value can contribute to the conclusion of copying. We show how we use the index to prune the data items we consider for each pair of sources, and to incrementally refine our results in iterative copy detection. We also apply a sampling strategy with which we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Data Quality and Management · Scientific Computing and Data Management