SocioProbe: What, When, and Where Language Models Learn about Sociodemographics
Anne Lauscher, Federico Bianchi, Samuel Bowman, and Dirk Hovy

TL;DR
This paper investigates whether pre-trained language models encode sociodemographic information like gender and age, revealing that such knowledge is present but varies across layers and requires extensive pre-training data.
Contribution
It introduces probing methods to analyze sociodemographic knowledge in PLMs, including multilingual and training data effects, filling a gap in understanding higher-level language knowledge.
Findings
PLMs encode sociodemographic information.
Knowledge is distributed across layers in some models.
More pre-training data enhances sociodemographic knowledge.
Abstract
Pre-trained language models (PLMs) have outperformed other NLP models on a wide range of tasks. Opting for a more thorough understanding of their capabilities and inner workings, researchers have established the extend to which they capture lower-level knowledge like grammaticality, and mid-level semantic knowledge like factual understanding. However, there is still little understanding of their knowledge of higher-level aspects of language. In particular, despite the importance of sociodemographic aspects in shaping our language, the questions of whether, where, and how PLMs encode these aspects, e.g., gender or age, is still unexplored. We address this research gap by probing the sociodemographic knowledge of different single-GPU PLMs on multiple English data sets via traditional classifier probing and information-theoretic minimum description length probing. Our results show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
