Duplicate Detection with Efficient Language Models for Automatic   Bibliographic Heterogeneous Data Integration

Nicolas Turenne

arXiv:1504.07597·cs.DB·April 29, 2015

Duplicate Detection with Efficient Language Models for Automatic Bibliographic Heterogeneous Data Integration

Nicolas Turenne

PDF

Open Access

TL;DR

This paper introduces an efficient duplicate detection method for merging heterogeneous bibliographic databases using lexical and social cues, achieving high accuracy in identifying duplicate records.

Contribution

The paper proposes a novel duplicate detection approach based on n-gram key fingerprints, improving precision and recall in bibliographic data integration.

Findings

01

Achieved 95% recall in duplicate detection

02

Achieved 100% precision in duplicate detection

03

Outperformed existing deduplication methods

Abstract

We present a new method to detect duplicates used to merge different bibliographic record corpora with the help of lexical and social information. As we show, a trivial key is not available to delete useless documents. Merging heteregeneous document databases to get a maximum of information can be of interest. In our case we try to build a document corpus about the TOR molecule so as to extract relationships with other gene components from PubMed and WebOfScience document databases. Our approach makes key fingerprints based on n-grams. We made two documents gold standards using this corpus to make an evaluation. Comparison with other well-known methods in deduplication gives best scores of recall (95\%) and precision (100\%).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Web Data Mining and Analysis · Natural Language Processing Techniques