How far can bias go? Tracing bias from pretraining data to alignment
Marion Thaler, Abdullatif K\"oksal, Alina Leidinger, Anna Korhonen, Hinrich Sch\"utze

TL;DR
This paper investigates how gender-occupation biases in pretraining data influence biases in large language models, highlighting the amplification of such biases and the effects of tuning and prompting methods.
Contribution
It provides a detailed analysis of bias origins in LLMs, emphasizing the importance of addressing bias during pretraining and evaluating the impact of instruction-tuning and hyperparameters.
Findings
Biases in pretraining data are amplified in model outputs.
Instruction-tuning partially reduces representational bias.
Prompt types and hyperparameters have limited effect on bias expression.
Abstract
As LLMs are increasingly integrated into user-facing applications, addressing biases that perpetuate societal inequalities is crucial. While much work has gone into measuring or mitigating biases in these models, fewer studies have investigated their origins. Therefore, this study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs, focusing on the Dolma dataset and the OLMo model. Using zero-shot prompting and token co-occurrence analyses, we explore how biases in training data influence model outputs. Our findings reveal that biases present in pre-training data are amplified in model outputs. The study also examines the effects of prompt types, hyperparameters, and instruction-tuning on bias expression, finding instruction-tuning partially alleviating representational bias while still maintaining overall stereotypical gender…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
