Supervised machine learning techniques for data matching based on similarity metrics
Pim Verschuuren, Serena Palazzo, Tom Powell, Steve Sutton, Alfred, Pilgrim, Michele Faucci Giannelli

TL;DR
This paper explores the application of supervised machine learning techniques, specifically neural networks and decision trees, combined with string similarity metrics, to improve data matching accuracy and efficiency in invoice datasets.
Contribution
It introduces a novel approach integrating machine learning with string similarity functions for data matching and compares its performance against existing solutions.
Findings
Neural network and decision tree models outperform existing deduplication solutions.
The approach reduces pair dimensionality effectively.
Models achieve comparable or better accuracy in invoice data matching.
Abstract
Businesses, governmental bodies and NGO's have an ever-increasing amount of data at their disposal from which they try to extract valuable information. Often, this needs to be done not only accurately but also within a short time frame. Clean and consistent data is therefore crucial. Data matching is the field that tries to identify instances in data that refer to the same real-world entity. In this study, machine learning techniques are combined with string similarity functions to the field of data matching. A dataset of invoices from a variety of businesses and organizations was preprocessed with a grouping scheme to reduce pair dimensionality and a set of similarity functions was used to quantify similarity between invoice pairs. The resulting invoice pair dataset was then used to train and validate a neural network and a boosted decision tree. The performance was compared with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Mining Algorithms and Applications · Privacy-Preserving Technologies in Data
