NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

Cong Ming; Ruixin Shi; Yifan Hu

arXiv:2604.10401·cs.CL·April 22, 2026

NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

Cong Ming, Ruixin Shi, Yifan Hu

PDF

TL;DR

NameBERT leverages large language models to augment datasets for improved nationality classification from names, achieving higher accuracy and efficiency over existing methods.

Contribution

The paper introduces a novel framework that uses LLMs to enrich name datasets, enhancing nationality prediction especially for underrepresented countries.

Findings

01

Augmentation with LLM-generated names improves tail-country classification accuracy.

02

NameBERT outperforms state-of-the-art baselines in both in- and out-of-domain tasks.

03

The approach maintains efficiency suitable for large-scale inference.

Abstract

Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.