CAFP: A Post-Processing Framework for Group Fairness via Counterfactual Model Averaging
Irina Ar\'evalo, Marcos Oliva

TL;DR
CAFP is a post-processing framework that enhances fairness in machine learning predictions by averaging model outputs over counterfactual inputs with flipped sensitive attributes, without retraining.
Contribution
It introduces a model-agnostic, post-processing method that guarantees fairness metrics like demographic parity and equalized odds reduction without modifying the original classifier.
Findings
CAFP eliminates direct dependence on protected attributes.
It reduces mutual information between predictions and sensitive attributes.
Under mild assumptions, it achieves perfect demographic parity.
Abstract
Ensuring fairness in machine learning predictions is a critical challenge, especially when models are deployed in sensitive domains such as credit scoring, healthcare, and criminal justice. While many fairness interventions rely on data preprocessing or algorithmic constraints during training, these approaches often require full control over the model architecture and access to protected attribute information, which may not be feasible in real-world systems. In this paper, we propose Counterfactual Averaging for Fair Predictions (CAFP), a model-agnostic post-processing method that mitigates unfair influence from protected attributes without retraining or modifying the original classifier. CAFP operates by generating counterfactual versions of each input in which the sensitive attribute is flipped, and then averaging the model's predictions across factual and counterfactual instances. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
