Large Language Models Naively Recover Ethnicity from Individual Records
Noah Dasanaike

TL;DR
Large language models can accurately infer ethnicity from names across various countries, outperforming traditional methods like BISG, and can be deployed efficiently at scale with fine-tuned small models.
Contribution
This paper demonstrates that large language models can naively recover ethnicity from names with higher accuracy than existing methods, enabling broader and more accurate demographic inference.
Findings
LLMs achieve up to 84.7% accuracy in ethnicity classification.
LLMs outperform BISG, which has 68.2% accuracy on balanced samples.
Fine-tuned small transformer models can surpass BISG accuracy at lower costs.
Abstract
I demonstrate that large language models can infer ethnicity from names with accuracy exceeding that of Bayesian Improved Surname Geocoding (BISG) without additional training data, enabling inference outside the United States and to contextually appropriate classification categories. Using stratified samples from Florida and North Carolina voter files with self-reported race, LLM-based classification achieves up to 84.7% accuracy, outperforming BISG (68.2%) on balanced samples. I test six models including Gemini 3 Flash, GPT-4o, and open-source alternatives such as DeepSeek v3.2 and GLM-4.7. Enabling extended reasoning can improve accuracy by 1-3 percentage points, though effects vary across contexts; including metadata such as party registration reaches 86.7%. LLM classification also reduces the income bias inherent in BISG, where minorities in wealthier neighborhoods are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Names, Identity, and Discrimination Research · Data Quality and Management
