NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data
Cong Ming, Ruixin Shi, Yifan Hu

TL;DR
NameBERT leverages large language models to augment datasets for improved nationality classification from names, achieving higher accuracy and efficiency over existing methods.
Contribution
The paper introduces a novel framework that uses LLMs to enrich name datasets, enhancing nationality prediction especially for underrepresented countries.
Findings
Augmentation with LLM-generated names improves tail-country classification accuracy.
NameBERT outperforms state-of-the-art baselines in both in- and out-of-domain tasks.
The approach maintains efficiency suitable for large-scale inference.
Abstract
Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
