Using Embedding Models to Improve Probabilistic Race Prediction
Noah Dasanaike, Kosuke Imai

TL;DR
This paper introduces embedding-powered BISG (eBISG), a novel method using text embeddings and neural networks to improve race prediction accuracy for individuals with uncommon surnames not covered by Census data.
Contribution
The paper develops eBISG, a new approach that leverages pre-trained embeddings and neural networks to enhance race prediction for underrepresented surnames.
Findings
eBISG improves race prediction accuracy over standard BISG.
Full-name embedding yields the largest gains, especially for Hispanic and Asian voters.
Embedding approaches outperform traditional methods for names not in Census data.
Abstract
Estimating racial disparity requires individual-level race data, which are often unavailable due to the sensitivity of collecting such information. To address this problem, many researchers utilize Bayesian Improved Surname Geocoding (BISG), which have critically relied on Census surname data. Unfortunately, these data capture race-surname relationships only for common surnames, omitting approximately 10% of the US population. We show that predictive performance degrades substantially for individuals with such omitted, uncommon surnames because standard BISG implementation relies on a uninformative generic prior in these cases. To address this limitation, we propose embedding-powered BISG (eBISG), which uses pre-trained text embeddings to represent names as dense vectors and trains neural networks on 2020 Census surname and first-name data to estimate race probabilities for names not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
