Bias Testing and Mitigation in Black Box LLMs using Metamorphic Relations
Sina Salimian, Gias Uddin, Sumon Biswas, Henry Leung

TL;DR
This paper introduces a set of metamorphic relations to systematically detect and mitigate social biases in large language models, improving fairness and robustness through automated testing and targeted fine-tuning.
Contribution
It proposes six novel metamorphic relations for bias testing and mitigation, enabling automated bias detection and effective bias reduction in LLMs.
Findings
Detects up to 14% more hidden biases than existing tools.
Fine-tuning with MR-generated samples increases safe response rates from 54.7% to over 88.9%.
Applicable to both open-source and proprietary LLMs.
Abstract
The widespread deployment of Large Language Models (LLMs) has intensified concerns about subtle social biases embedded in their outputs. Existing guardrails often fail when faced with indirect or contextually complex bias-inducing prompts. To address these limitations, we propose a unified framework for both systematic bias evaluation and targeted mitigation. Our approach introduces six novel Metamorphic Relations (MRs) that, based on metamorphic testing principles, transform direct bias-inducing inputs into semantically equivalent yet adversarially challenging variants. These transformations enable an automated method for exposing hidden model biases: when an LLM responds inconsistently or unfairly across MR-generated variants, the underlying bias becomes detectable. We further show that the same MRs can be used to generate diverse bias-inducing samples for fine-tuning, directly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Adversarial Robustness in Machine Learning
