Identifying Features Associated with Bias Against 93 Stigmatized Groups in Language Models and Guardrail Model Safety Mitigation
Anna-Maria Gueorguieva, Aylin Caliskan

TL;DR
This study examines how social features of stigmas influence bias in language models and evaluates the effectiveness of guardrail models in reducing such bias, highlighting persistent challenges in bias mitigation.
Contribution
It identifies social features linked to bias in LLM outputs and assesses the impact of guardrail models on reducing bias against stigmatized groups.
Findings
Highly perilous stigmas lead to more biased outputs (60%).
Guardrail models reduce bias by approximately 10%.
Features influencing bias remain unchanged after mitigation.
Abstract
Large language models (LLMs) have been shown to exhibit social bias, however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared social features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding wether to recommend them for an internship. We find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Computational and Text Analysis Methods · Mental Health via Writing
