Nationality Classification Using Name Embeddings
Junting Ye, Shuchu Han, Yifan Hu, Baris Coskun, Meizhu Liu, Hong Qin,, Steven Skiena

TL;DR
This paper introduces a highly accurate, fine-grained nationality classifier based on name embeddings learned from communication patterns, outperforming previous methods and revealing social insights through social media analysis.
Contribution
We develop a novel name embedding approach leveraging communication homophily, enabling a large-scale, fine-grained nationality classifier with superior accuracy.
Findings
Achieved an F1 score of 0.795 on 13 classes, outperforming prior systems.
Successfully classified 39 nationality groups covering over 90% of the world.
Revealed demographic and ethnic patterns in social media followers.
Abstract
Nationality identification unlocks important demographic information, with many applications in biomedical and sociological research. Existing name-based nationality classifiers use name substrings as features and are trained on small, unrepresentative sets of labeled names, typically extracted from Wikipedia. As a result, these methods achieve limited performance and cannot support fine-grained classification. We exploit the phenomena of homophily in communication patterns to learn name embeddings, a new representation that encodes gender, ethnicity, and nationality which is readily applicable to building classifiers and other systems. Through our analysis of 57M contact lists from a major Internet company, we are able to design a fine-grained nationality classifier covering 39 groups representing over 90% of the world population. In an evaluation against other published systems over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection · Misinformation and Its Impacts
