Silenced Biases: The Dark Side LLMs Learned to Refuse
Rom Himelstein, Amit LeVi, Brit Youngmann, Yaniv Nemcovsky, Avi Mendelson

TL;DR
This paper introduces the Silenced Bias Benchmark (SBB) to reveal hidden biases in safety-aligned large language models by reducing refusal responses, exposing underlying fairness issues that standard evaluations overlook.
Contribution
The paper presents a novel activation steering method and the SBB framework to uncover latent biases in LLMs, overcoming limitations of previous prompt-based approaches.
Findings
Models exhibit significant underlying biases despite refusal responses.
Activation steering effectively uncovers hidden fairness issues.
Evaluation framework supports scalable and comprehensive bias assessment.
Abstract
Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model's refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models' latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEthics and Social Impacts of AI · Adversarial Robustness in Machine Learning · Topic Modeling
