Are Large Language Models Really Bias-Free? Jailbreak Prompts for   Assessing Adversarial Robustness to Bias Elicitation

Riccardo Cantini; Giada Cosenza; Alessio Orsino; Domenico Talia

arXiv:2407.08441·cs.CL·February 14, 2025

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

Riccardo Cantini, Giada Cosenza, Alessio Orsino, Domenico Talia

PDF

Open Access 1 Repo

TL;DR

This paper investigates the biases in large language models, demonstrating how adversarial prompts can reveal hidden biases, and emphasizes the need for improved mitigation techniques to ensure fairness and safety.

Contribution

It introduces jailbreak prompts specifically designed to assess the adversarial robustness of LLMs against bias elicitation, revealing vulnerabilities in current models.

Findings

01

LLMs can be manipulated to produce biased responses

02

Adversarial prompts effectively reveal hidden biases

03

Current alignment techniques are insufficient for bias mitigation

Abstract

Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable computational power and linguistic capabilities. However, these models are inherently prone to various biases stemming from their training data. These include selection, linguistic, and confirmation biases, along with common stereotypes related to gender, ethnicity, sexual orientation, religion, socioeconomic status, disability, and age. This study explores the presence of these biases within the responses given by the most recent LLMs, analyzing the impact on their fairness and reliability. We also investigate how known prompt engineering techniques can be exploited to effectively reveal hidden biases of LLMs, testing their adversarial robustness against jailbreak prompts specially crafted for bias elicitation. Extensive experiments are conducted using the most widespread LLMs at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SCAlabUnical/LLM-Bias-Jailbreak
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI