Duplicate Detection with Efficient Language Models for Automatic Bibliographic Heterogeneous Data Integration
Nicolas Turenne

TL;DR
This paper introduces an efficient duplicate detection method for merging heterogeneous bibliographic databases using lexical and social cues, achieving high accuracy in identifying duplicate records.
Contribution
The paper proposes a novel duplicate detection approach based on n-gram key fingerprints, improving precision and recall in bibliographic data integration.
Findings
Achieved 95% recall in duplicate detection
Achieved 100% precision in duplicate detection
Outperformed existing deduplication methods
Abstract
We present a new method to detect duplicates used to merge different bibliographic record corpora with the help of lexical and social information. As we show, a trivial key is not available to delete useless documents. Merging heteregeneous document databases to get a maximum of information can be of interest. In our case we try to build a document corpus about the TOR molecule so as to extract relationships with other gene components from PubMed and WebOfScience document databases. Our approach makes key fingerprints based on n-grams. We made two documents gold standards using this corpus to make an evaluation. Comparison with other well-known methods in deduplication gives best scores of recall (95\%) and precision (100\%).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Web Data Mining and Analysis · Natural Language Processing Techniques
