Duplicate Detection with GenAI

Ian Ormesher

arXiv:2406.15483·cs.CL·June 25, 2024

Duplicate Detection with GenAI

Ian Ormesher

PDF

Open Access

TL;DR

This paper demonstrates how leveraging Large Language Models and Generative AI significantly enhances the accuracy of detecting and repairing duplicate customer records in CRMs, doubling the success rate compared to traditional NLP methods.

Contribution

The paper introduces a novel approach using GenAI for duplicate detection in CRMs, achieving nearly double the accuracy of existing NLP-based techniques.

Findings

01

De-duplication accuracy improved from 30% to 60%.

02

GenAI approach outperforms traditional NLP methods.

03

Benchmark datasets validate the effectiveness of the proposed method.

Abstract

Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually entered into such systems by one of more users over time leads to data replication, partial duplication or fuzzy duplication. This in turn means that there no longer a single source of truth for customers, contacts, accounts, etc. Downstream business processes become increasing complex and contrived without a unique mapping between a record in a CRM and the target customer. Current methods to detect and de-duplicate records use traditional Natural Language Processing techniques known as Entity Matching. In this paper we show how using the latest advancements in Large Language Models and Generative AI can vastly improve the identification and repair of duplicated records. On common benchmark datasets we find an improvement in the accuracy of data de-duplication rates from 30…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Digital Media Forensic Detection · Machine Learning and Data Classification