Identifying Features Associated with Bias Against 93 Stigmatized Groups in Language Models and Guardrail Model Safety Mitigation

Anna-Maria Gueorguieva; Aylin Caliskan

arXiv:2512.19238·cs.CL·December 23, 2025

Identifying Features Associated with Bias Against 93 Stigmatized Groups in Language Models and Guardrail Model Safety Mitigation

Anna-Maria Gueorguieva, Aylin Caliskan

PDF

Open Access

TL;DR

This study examines how social features of stigmas influence bias in language models and evaluates the effectiveness of guardrail models in reducing such bias, highlighting persistent challenges in bias mitigation.

Contribution

It identifies social features linked to bias in LLM outputs and assesses the impact of guardrail models on reducing bias against stigmatized groups.

Findings

01

Highly perilous stigmas lead to more biased outputs (60%).

02

Guardrail models reduce bias by approximately 10%.

03

Features influencing bias remain unchanged after mitigation.

Abstract

Large language models (LLMs) have been shown to exhibit social bias, however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared social features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding wether to recommend them for an internship. We find that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Computational and Text Analysis Methods · Mental Health via Writing