Predicting Race and Ethnicity From the Sequence of Characters in a Name
Rajashekar Chintalapati, Suriyan Laohaprapanon, and Gaurav Sood

TL;DR
This paper develops models to predict race and ethnicity from full names, improving accuracy over existing methods by using character sequences and applying them to real-world data analysis.
Contribution
It introduces character sequence-based models, including an LSTM, for race and ethnicity prediction from names, surpassing traditional last-name only approaches.
Findings
LSTM model achieves 85% accuracy with first names.
Last-name model achieves 81% accuracy.
Models applied to analyze racial representation in finance and news.
Abstract
To answer questions about racial inequality and fairness, we often need a way to infer race and ethnicity from names. One way to infer race and ethnicity from names is by relying on the Census Bureau's list of popular last names. The list, however, suffers from at least three limitations: 1. it only contains last names, 2. it only includes popular last names, and 3. it is updated once every 10 years. To provide better generalization, and higher accuracy when first names are available, we model the relationship between characters in a name and race and ethnicity using various techniques. A model using Long Short-Term Memory works best with out-of-sample accuracy of .85. The best-performing last-name model achieves out-of-sample accuracy of .81. To illustrate the utility of the models, we apply them to campaign finance data to estimate the share of donations made by people of various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNames, Identity, and Discrimination Research · Authorship Attribution and Profiling
