raceBERT -- A Transformer-based Model for Predicting Race and Ethnicity from Names
Prasanna Parasurama

TL;DR
raceBERT introduces a transformer-based model that significantly improves the accuracy of predicting race and ethnicity from names, outperforming previous models with state-of-the-art F1 scores.
Contribution
The paper develops raceBERT, a transformer-based model for race prediction from names, replacing LSTM with BERT/roBERTa, achieving superior accuracy and providing an open-source Python package.
Findings
Achieves an average F1-score of 0.86, outperforming previous methods.
Improves prediction accuracy for non-white names by 15-17%.
Demonstrates the effectiveness of transformer models over LSTM for this task.
Abstract
This paper presents raceBERT -- a transformer-based model for predicting race and ethnicity from character sequences in names, and an accompanying python package. Using a transformer-based model trained on a U.S. Florida voter registration dataset, the model predicts the likelihood of a name belonging to 5 U.S. census race categories (White, Black, Hispanic, Asian & Pacific Islander, American Indian & Alaskan Native). I build on Sood and Laohaprapanon (2018) by replacing their LSTM model with transformer-based models (pre-trained BERT model, and a roBERTa model trained from scratch), and compare the results. To the best of my knowledge, raceBERT achieves state-of-the-art results in race prediction using names, with an average f1-score of 0.86 -- a 4.1% improvement over the previous state-of-the-art, and improvements between 15-17% for non-white names.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNames, Identity, and Discrimination Research · Authorship Attribution and Profiling · Forensic and Genetic Research
Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide) · Attention Is All You Need · Linear Layer · Tanh Activation · Sigmoid Activation · Attention Dropout · Linear Warmup With Linear Decay · Softmax · Weight Decay · WordPiece
