Enriching Datasets with Demographics through Large Language Models: What's in a Name?
Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi,, Andreas Henschel, Raed Jaradat

TL;DR
This paper explores using large language models' zero-shot abilities to infer demographic information from names, addressing data scarcity and bias issues in traditional methods across diverse datasets.
Contribution
It demonstrates that LLMs can effectively predict demographics from names without specialized training, outperforming previous models and revealing inherent biases.
Findings
LLMs perform comparably or better than traditional models in demographic prediction.
Application to real-world unlabelled datasets shows practical utility.
Analysis uncovers demographic biases in LLM predictions.
Abstract
Enriching datasets with demographic information, such as gender, race, and age from names, is a critical task in fields like healthcare, public policy, and social sciences. Such demographic insights allow for more precise and effective engagement with target populations. Despite previous efforts employing hidden Markov models and recurrent neural networks to predict demographics from names, significant limitations persist: the lack of large-scale, well-curated, unbiased, publicly available datasets, and the lack of an approach robust across datasets. This scarcity has hindered the development of traditional supervised learning approaches. In this paper, we demonstrate that the zero-shot capabilities of Large Language Models (LLMs) can perform as well as, if not better than, bespoke models trained on specialized data. We apply these LLMs to a variety of datasets, including a real-life,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · demographic modeling and climate adaptation · Computational and Text Analysis Methods
