Improving Fairness in LLMs Through Testing-Time Adversaries
Isabela Pereira Gregio, Ian Pons, Anna Helena Reali Costa, Artur Jord\~ao

TL;DR
This paper introduces a testing-time adversarial method that creates sentence variations to detect and reduce biases in LLMs, significantly improving fairness metrics without retraining or fine-tuning.
Contribution
The proposed approach is a simple, practical, and training-free method that enhances fairness in LLMs by identifying biased predictions through sentence variation analysis.
Findings
Improves fairness metrics in Llama models by up to 27 percentage points.
Eliminates need for training or fine-tuning, making it practical for real-world use.
Effectively reduces disparities across racial groups in model predictions.
Abstract
Large Language Models (LLMs) push the bound-aries in natural language processing and generative AI, driving progress across various aspects of modern society. Unfortunately, the pervasive issue of bias in LLMs responses (i.e., predictions) poses a significant and open challenge, hindering their application in tasks involving ethical sensitivity and responsible decision-making. In this work, we propose a straightforward, user-friendly and practical method to mitigate such biases, enhancing the reliability and trustworthiness of LLMs. Our method creates multiple variations of a given sentence by modifying specific attributes and evaluates the corresponding prediction behavior compared to the original, unaltered, prediction/sentence. The idea behind this process is that critical ethical predictions often exhibit notable inconsistencies, indicating the presence of bias. Unlike previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
MethodsLLaMA
