Language Models That Walk the Talk: A Framework for Formal Fairness Certificates

Danqing Chen; Tobias Ladner; Ahmed Rayen Mhadhbi; Matthias Althoff

arXiv:2505.12767·cs.AI·February 2, 2026

Language Models That Walk the Talk: A Framework for Formal Fairness Certificates

Danqing Chen, Tobias Ladner, Ahmed Rayen Mhadhbi, Matthias Althoff

PDF

Open Access

TL;DR

This paper introduces a comprehensive framework for formally verifying the robustness and fairness of large language models, particularly in gender bias mitigation and toxicity detection, to ensure ethical and reliable AI deployment.

Contribution

It develops a novel verification framework tailored for transformer-based language models, extending formal guarantees to fairness and safety-critical tasks.

Findings

01

Certifies robustness of language models against adversarial perturbations.

02

Provides formal fairness guarantees in gender bias mitigation.

03

Ensures consistent toxicity detection under adversarial manipulation.

Abstract

As large language models become integral to high-stakes applications, ensuring their robustness and fairness is critical. Despite their success, large language models remain vulnerable to adversarial attacks, where small perturbations, such as synonym substitutions, can alter model predictions, posing risks in fairness-critical areas, such as gender bias mitigation, and safety-critical areas, such as toxicity detection. While formal verification has been explored for neural networks, its application to large language models remains limited. This work presents a holistic verification framework to certify the robustness of transformer-based language models, with a focus on ensuring gender fairness and consistent outputs across different gender-related terms. Furthermore, we extend this methodology to toxicity detection, offering formal guarantees that adversarially manipulated toxic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Ethics and Social Impacts of AI

MethodsFocus