Large-scale diversity estimation through surname origin inference
Antoine Mazi\`eres, Camille Roth

TL;DR
This paper introduces a data-driven surname origin classifier to estimate social group diversity at large scales, especially useful when direct data is limited, and analyzes its representativeness in French socio-professional groups.
Contribution
It develops a novel surname origin classifier based on a typology, enabling large-scale diversity estimation from scarce data and assessing its applicability to French social groups.
Findings
The classifier effectively estimates surname origins across large datasets.
Surname origins show significant variation among different socio-professional groups.
The methodology provides a new tool for social and demographic research.
Abstract
The study of surnames as both linguistic and geographical markers of the past has proven valuable in several research fields spanning from biology and genetics to demography and social mobility. This article builds upon the existing literature to conceive and develop a surname origin classifier based on a data-driven typology. This enables us to explore a methodology to describe large-scale estimates of the relative diversity of social groups, especially when such data is scarcely available. We subsequently analyze the representativeness of surname origins for 15 socio-professional groups in France.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
