Bias Testing and Mitigation in Black Box LLMs using Metamorphic Relations

Sina Salimian; Gias Uddin; Sumon Biswas; Henry Leung

arXiv:2512.00556·cs.SE·December 2, 2025

Bias Testing and Mitigation in Black Box LLMs using Metamorphic Relations

Sina Salimian, Gias Uddin, Sumon Biswas, Henry Leung

PDF

Open Access

TL;DR

This paper introduces a set of metamorphic relations to systematically detect and mitigate social biases in large language models, improving fairness and robustness through automated testing and targeted fine-tuning.

Contribution

It proposes six novel metamorphic relations for bias testing and mitigation, enabling automated bias detection and effective bias reduction in LLMs.

Findings

01

Detects up to 14% more hidden biases than existing tools.

02

Fine-tuning with MR-generated samples increases safe response rates from 54.7% to over 88.9%.

03

Applicable to both open-source and proprietary LLMs.

Abstract

The widespread deployment of Large Language Models (LLMs) has intensified concerns about subtle social biases embedded in their outputs. Existing guardrails often fail when faced with indirect or contextually complex bias-inducing prompts. To address these limitations, we propose a unified framework for both systematic bias evaluation and targeted mitigation. Our approach introduces six novel Metamorphic Relations (MRs) that, based on metamorphic testing principles, transform direct bias-inducing inputs into semantically equivalent yet adversarially challenging variants. These transformations enable an automated method for exposing hidden model biases: when an LLM responds inconsistently or unfairly across MR-generated variants, the underlying bias becomes detectable. We further show that the same MRs can be used to generate diverse bias-inducing samples for fine-tuning, directly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Adversarial Robustness in Machine Learning