Race and ethnicity data for first, middle, and last names
Evan T. R. Rosenman, Santiago Olivella, and Kosuke Imai

TL;DR
This paper introduces the largest publicly available dictionaries of names linked to racial and ethnic data, enabling improved race/ethnicity imputation for datasets lacking such information.
Contribution
It provides extensive, publicly accessible name dictionaries with racial/ethnic counts derived from voter files, enhancing race imputation methods like BISG.
Findings
Dictionaries contain roughly 1 million first names, 1.1 million middle names, and 1.4 million surnames.
Names are categorized into five racial/ethnic groups with associated counts.
Conditional probabilities of race given name are provided for data imputation.
Abstract
We provide the largest compiled publicly available dictionaries of first, middle, and last names for the purpose of imputing race and ethnicity using, for example, Bayesian Improved Surname Geocoding (BISG). The dictionaries are based on the voter files of six Southern states that collect self-reported racial data upon voter registration. Our data cover a much larger scope of names than any comparable dataset, containing roughly one million first names, 1.1 million middle names, and 1.4 million surnames. Individuals are categorized into five mutually exclusive racial and ethnic groups -- White, Black, Hispanic, Asian, and Other -- and racial/ethnic counts by name are provided for every name in each dictionary. Counts can then be normalized row-wise or column-wise to obtain conditional probabilities of race given name or name given race. These conditional probabilities can then be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNames, Identity, and Discrimination Research
