Linking Datasets on Organizations Using Half A Billion Open-Collaborated Records
Brian Libgober, Connor T. Jerzak

TL;DR
This paper leverages a massive LinkedIn dataset to improve organizational name matching, addressing limitations of traditional fuzzy and machine learning methods by incorporating trillions of name pairs and network modeling.
Contribution
It introduces a new large-scale LinkedIn-based training corpus and network modeling approach to significantly enhance organizational name matching accuracy.
Findings
Improved matching performance using LinkedIn data
Enhanced calibration of matching probabilities
Open source data and methods for reproducibility
Abstract
Scholars studying organizations often work with multiple datasets lacking shared identifiers or covariates. In such situations, researchers usually use approximate string ("fuzzy") matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even where two strings appear similar to humans, fuzzy matching often struggles because it fails to adapt to the informativeness of the character combinations. In response, a number of machine learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. By leveraging information from the LinkedIn corpus regarding organizational name-to-name links, we incorporate trillions of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management
