Training on Plausible Counterfactuals Removes Spurious Correlations

Shpresim Sadiku; Kartikeya Chitranshi; Hiroshi Kera; Sebastian Pokutta

arXiv:2505.16583·cs.LG·November 14, 2025

Training on Plausible Counterfactuals Removes Spurious Correlations

Shpresim Sadiku, Kartikeya Chitranshi, Hiroshi Kera, Sebastian Pokutta

PDF

Open Access

TL;DR

Training classifiers on plausible counterfactual explanations labeled with incorrect classes can reduce reliance on spurious correlations, improving robustness and fairness.

Contribution

This work extends the paradigm of learning from adversarial examples to plausible counterfactuals, demonstrating enhanced bias reduction and accuracy.

Findings

01

Classifiers trained on p-CFEs achieve high in-distribution accuracy.

02

Training on p-CFEs reduces reliance on spurious correlations.

03

Learning from p-CFEs is more effective than from adversarial perturbations.

Abstract

Plausible counterfactual explanations (p-CFEs) are perturbations that minimally modify inputs to change classifier decisions while remaining plausible under the data distribution. In this study, we demonstrate that classifiers can be trained on p-CFEs labeled with induced \emph{incorrect} target classes to classify unperturbed inputs with the original labels. While previous studies have shown that such learning is possible with adversarial perturbations, we extend this paradigm to p-CFEs. Interestingly, our experiments reveal that learning from p-CFEs is even more effective: the resulting classifiers achieve not only high in-distribution accuracy but also exhibit significantly reduced bias with respect to spurious correlations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis