Exposing the Illusion of Fairness: Auditing Vulnerabilities to Distributional Manipulation Attacks
Valentin Lafargue, Adriana Laurindo Monteiro, Emmanuelle Claeys, Laurent Risser, Jean-Michel Loubes

TL;DR
This paper reveals how malicious actors can craft samples that appear fair and representative to deceive fairness audits in AI systems, highlighting vulnerabilities and proposing detection strategies.
Contribution
It formalizes the problem of fairness manipulation as a distributional projection, introduces manipulation strategies, and evaluates detection methods to improve audit robustness.
Findings
Manipulation can satisfy fairness constraints with minimal distributional shift.
Statistical tests can detect some manipulated samples but may fail under certain conditions.
Guidelines are provided to strengthen verification against distributional manipulation.
Abstract
The rapid deployment of AI systems in high-stakes domains, including those classified as high-risk under the The EU AI Act (Regulation (EU) 2024/1689), has intensified the need for reliable compliance auditing. For binary classifiers, regulatory risk assessment often relies on global fairness metrics such as the Disparate Impact ratio, widely used to evaluate potential discrimination. In typical auditing settings, the auditee provides a subset of its dataset to an auditor, while a supervisory authority may verify whether this subset is representative of the full underlying distribution. In this work, we investigate to what extent a malicious auditee can construct a fairness-compliant yet representative-looking sample from a non-compliant original distribution, thereby creating an illusion of fairness. We formalize this problem as a constrained distributional projection task and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
+ the problem of reliably auditing classifier for fairness is timely and relevant + the proposed optimization problems make sense and can indeed be expected to result in hard-to-detect sampling distributions + source code and detailed appendices with formal derivations are provided
The manuscript has a number of weaknesses and aspects that are unclear to me. * Motivation The justification for the exact problem setting is unclear to me. The work adopt a setting where an audited entity (let's call it $E$) gives an auditor ($A$) access to a random subset of its data. A supervisor ($S$) is meant to check that this subset is representative. For this, $S$ gets access to $E$'s complete dataset. My concern is: if $E$ is potentially dishonest and manipulates the subset it give
1. The paper studies an important topic. 2. For the experiments authors considered a wide range of datasets.
1. Majority of the focus and experiments are on Disparate Impact. It would be nice if authors expand their findings and results on other fairness metrics as well. 2. Majority of the focus and experiments are on tabular data. While authors briefly talk about image data and some results in the appendix, I think it would be good if these discussions and experiments are extended and brought in the main body of the paper. This will strengthen the paper and improve its scope and impact. 3. It would be
- The paper addresses a critical problem. Exposing the vulnerabilities of the audits before they are widely deployed is significant. - The authors formalize the "fair-washing" attack and model it as a constrained optimization problem.
- The paper's idea is a three-party framework where the "supervisory authority" has access to the auditee's complete, original dataset. This is a very strong assumption that feels contradictory. If the authority already has the full, true dataset, it is unclear why they would bother analyzing a sample provided by the auditee. - The proposed defense relies on tests like the Wasserstein distance and MMD. These tests are difficult and expensive to compute in high dimensions. - The paper's methodo
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
