Towards Real-world Debiasing: Rethinking Evaluation, Challenge, and Solution
Peng Kuang, Zhibo Wang, Zhixuan Chu, Jingyi Wang, Kui Ren

TL;DR
This paper critically examines real-world biases in machine learning, proposing a systematic evaluation framework, new bias types, and a novel method to improve debiasing without bias labels, validated across multiple datasets.
Contribution
It introduces a fine-grained analysis of real-world biased distributions, proposes new biases and a systematic evaluation framework, and presents the DiD method for bias mitigation without bias labels.
Findings
Existing benchmarks poorly represent real-world biases.
Two new real-world-inspired biases are proposed.
The DiD method effectively reduces bias across 8 datasets.
Abstract
Spurious correlations in training data significantly hinder the generalization capability of machine learning models when faced with distribution shifts, leading to the proposition of numberous debiasing methods. However, it remains to be asked: \textit{Do existing benchmarks for debiasing really represent biases in the real world?} Recent works attempt to address such concerns by sampling from real-world data (instead of synthesizing) according to some predefined biased distributions to ensure the realism of individual samples. However, the realism of the biased distribution is more critical yet challenging and underexplored due to the complexity of real-world bias distributions. To tackle the problem, we propose a fine-grained framework for analyzing biased distributions, based on which we empirically and theoretically identify key characteristics of biased distributions in the real…
Peer Reviews
Decision·Submitted to ICLR 2026
Strengths are summarized as follows: (1) Clear problem framing: Figure 1 contrasts diagonal benchmark patterns with scattered real-world biases. (2) Rigorous theory: Propositions 1 and 2 justify the sparsity of real-world biases mathematically. (3) Comprehensive experiments: Covers 8 datasets (vision + NLP benchmarks), 9 baselines (e.g., LfF, DisEnt, BEL), multiple bias types.
** Critical Weaknesses (1) Limited real-world evidence: Detailed analysis only for COCO and COMPAS datasets. CelebA, MultiNLI, and CCW are minimally discussed, appearing mainly in Figure 2 and Appendix. CelebA is real-world, but coverage is limited; medical and social media domains are only motivationally mentioned. Core experiments likely rely on synthetic datasets (Colored MNIST and Corrupted CIFAR-10) (2) Experimental design flaw: LMLP with threshold 0 result in 0% BN samples, which contra
- Novel problem framing and fine-grained bias analysis The paper raises an important question about whether current benchmarks truly represent real-world biases. By distinguishing bias magnitude (how strong the spurious correlation is) and bias prevalence (how common it is in the dataset), the authors provide a meaningful and interpretable framework for bias characterization. This fine-grained view can serve as a strong foundation for future benchmark design. - Realistic motivation for bias-ag
- Lack of validation on real-world datasets (MS COCO, COMPAS) Although the introduction emphasizes real-world bias distributions and repeatedly mentions datasets such as MS COCO (for vision) and COMPAS (for fairness in tabular domains), the actual experiments are limited to synthetic or semi-synthetic settings such as Colored MNIST or Corrupted CIFAR-10 (referred to as HMLP BC). The absence of evaluation on these real datasets undermines the claim that DiD or RDBench effectively handles real-wor
(1) Presents a comprehensive empirical and theoretical analysis of real-world bias distributions, introducing the RDBench framework that provides a systematic and realistic benchmark for evaluating debiasing methods. (2) Proposes a simple yet effective Debias-in-Destruction (DiD) approach that generalizes well across multiple datasets and modalities, demonstrating strong improvements over existing debiasing methods.
(1) The clarity of theoretical exposition could be improved, especially regarding assumptions and proofs. (2) Evaluation on large-scale, high-dimensional real-world data (e.g., complex vision-language models) remains limited. (3) The DiD method’s simplicity, while appealing, may lack interpretability and deeper theoretical grounding. (4) Some parts of the framework (e.g., threshold selection for bias magnitude/prevalence) rely on heuristics rather than principled estimation.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
