Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation
Kaveh Eskandari Miandoab, Mahammed Kamruzzaman, Arshia Gharooni, Gene Louis Kim, Vasanth Sarathy, Ninareh Mehrabi

TL;DR
This paper introduces a flexible augmentation framework to evaluate and reveal biases in Large Language Models, demonstrating their susceptibility to input perturbations and highlighting biases against less-studied communities.
Contribution
The authors propose a novel, general augmentation method for bias evaluation applicable across multiple benchmarks, revealing vulnerabilities in LLMs' fairness and safety.
Findings
LLMs are sensitive to input perturbations, increasing stereotypical responses.
Biases are more prevalent when target demographics are from less-studied communities.
The framework is applicable to various fairness evaluation benchmarks.
Abstract
Large Language Models have been shown to demonstrate stereotypical biases in their representations and behavior due to the discriminative nature of the data that they have been trained on. Despite significant progress in the development of methods and models that refrain from using stereotypical information in their decision-making, recent work has shown that approaches used for bias alignment are brittle. In this work, we introduce a novel and general augmentation framework that involves three plug-and-play steps and is applicable to a number of fairness evaluation benchmarks. Through application of augmentation to a fairness evaluation dataset (Bias Benchmark for Question Answering (BBQ)), we find that Large Language Models (LLMs), including state-of-the-art open and closed weight models, are susceptible to perturbations to their inputs, showcasing a higher likelihood to behave…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
