Linking Datasets on Organizations Using Half A Billion Open-Collaborated Records

Brian Libgober; Connor T. Jerzak

arXiv:2302.02533·cs.SI·September 24, 2025·1 cites

Linking Datasets on Organizations Using Half A Billion Open-Collaborated Records

Brian Libgober, Connor T. Jerzak

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper leverages a massive LinkedIn dataset to improve organizational name matching, addressing limitations of traditional fuzzy and machine learning methods by incorporating trillions of name pairs and network modeling.

Contribution

It introduces a new large-scale LinkedIn-based training corpus and network modeling approach to significantly enhance organizational name matching accuracy.

Findings

01

Improved matching performance using LinkedIn data

02

Enhanced calibration of matching probabilities

03

Open source data and methods for reproducibility

Abstract

Scholars studying organizations often work with multiple datasets lacking shared identifiers or covariates. In such situations, researchers usually use approximate string ("fuzzy") matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even where two strings appear similar to humans, fuzzy matching often struggles because it fails to adapt to the informativeness of the character combinations. In response, a number of machine learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. By leveraging information from the LinkedIn corpus regarding organizational name-to-name links, we incorporate trillions of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cjerzak/LinkOrgs-software
tfOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management