A word association network methodology for evaluating implicit biases in LLMs compared to humans
Katherine Abramski, Giulio Rossetti, Massimo Stella

TL;DR
This paper introduces a novel word association network method to evaluate and compare implicit social biases in large language models and humans, providing insights into their alignment and potential risks.
Contribution
The paper presents a new methodology based on semantic priming in word association networks to assess implicit biases in LLMs and compare them with human biases.
Findings
Identifies convergences and divergences in biases between LLMs and humans.
Provides a scalable framework for bias evaluation across multiple models and human data.
Reveals potential social risks associated with LLM biases.
Abstract
As Large language models (LLMs) become increasingly integrated into our lives, their inherent social biases remain a pressing concern. Detecting and evaluating these biases can be challenging because they are often implicit rather than explicit in nature, so developing evaluation methods that assess the implicit knowledge representations of LLMs is essential. We present a novel word association network methodology for evaluating implicit biases in LLMs based on simulating semantic priming within LLM-generated word association networks. Our prompt-based approach taps into the implicit relational structures encoded in LLMs, providing both quantitative and qualitative assessments of bias. Unlike most prompt-based evaluation methods, our method enables direct comparisons between various LLMs and humans, providing a valuable point of reference and offering new insights into the alignment of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
