Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs
Xuwei Tan, Ziyu Hu, Xueru Zhang

TL;DR
This paper introduces NH-Fair, a comprehensive benchmark for evaluating bias mitigation methods in vision and vision-language models, emphasizing standardized evaluation, hyperparameter tuning, and practical fairness strategies.
Contribution
It presents a unified benchmarking framework, systematic hyperparameter tuning insights, and evidence that data augmentation outperforms many debiasing methods for fairness without utility loss.
Findings
Many debiasing methods do not outperform a well-tuned ERM baseline.
Data augmentation consistently improves fairness without sacrificing utility.
LVLMs have higher accuracy but still exhibit subgroup disparities.
Abstract
Machine learning models trained on real-world data often inherit and amplify biases against certain social groups, raising urgent concerns about their deployment at scale. While numerous bias mitigation methods have been proposed, comparing the effectiveness of bias mitigation methods remains difficult due to heterogeneous datasets, inconsistent fairness metrics, isolated evaluation of vision versus multi-modal models, and insufficient hyperparameter tuning that undermines fair comparisons. We introduce NH-Fair, a unified benchmark for fairness without harm that spans both vision models and large vision-language models (LVLMs) under standardized data, metrics, and training protocols, covering supervised and zero-shot regimes. Our key contributions are: (1) a systematic ERM tuning study that identifies training choices with large influence on both utility and disparities, yielding…
Peer Reviews
Decision·ICLR 2026 Poster
- This work addresses several shortcomings in fairness evaluation in prior work, such as inconsistent hyperparameter selection, overlooked utility performance and not including pre-trained foundation models. - The evaluation framework is systematic, consisting of multiple datasets and domains, principled training and model selection, and numerous utility and bias metrics. This is crucial to ensure fair comparison between methods as well as reproducibility. - Extensive ablation experiments on tra
- The proposed benchmark is limited to classification task. Despite being reformulated for image-text matching (CLIP) and generative (LVLM) models, it does not truly address utility and fairness beyond closed-set predictions. I would have liked to see open-set tasks like free-form image-text retrieval and open-ended VQA, captioning or reasoning, as these are the tasks where vision(-language) foundation models truly overtake task-specific vision models. This would also enable the holistic evaluat
S1: The authors conduct a systematic hyper-parameter optimisation and model selection protocol, a step which is often overlooked in bias mitigation benchmarks. S2: The authors present some unique analyses which are not usually discussed in bias mitigation papers, for instance on the impact of hyper-parameter choice on fairness metrics (e.g., choice of optimiser appears more impactful than choice of weight decay), whether pre-training is done (I liked the plots in the Appendix Fig 5), and the si
**Major weaknesses** W1: My primary criticism to the paper is that I am not sure what the field needs is another benchmark suggesting that overall ERM performs better/more reliably than existing mitigation methods (I would say there is already a broad consensus on this). I would argue that actually, instead of having an aggregate benchmark where many methods/models are compared across different datasets and metrics, it would make more sense to do a tailored analysis looking at which mitigation
1.Practical Problem: The "Fairness Without Harm" (FWH) principle is highly relevant for real-world deployment (e.g., healthcare). 2.Strong Baseline: The paper's key finding—that a well-tuned ERM can beat specialized algorithms—is a crucial, critical finding for the field, backed by extensive HPO. 3.Rigorous Protocol: The DTO/FWH model selection strategy provides a novel and fair method for comparing models. 4.Timely LVLM Analysis: The paper correctly identifies that LVLMs are not a panacea fo
1.Limited LVLM Scope: The LVLM evaluation is zero-shot only. This is a major gap, as models are typically fine-tuned, a process which could significantly alter fairness outcomes. 2.No Intersectional Analysis: The benchmark only considers single sensitive attributes, ignoring intersectional biases (e.g., race and gender), which can be more severe. 3.Lacks Deep Insight: The paper is excellent at observing (e.g., "RandAug works well" , "Optimizers matter" ), but provides little analysis as to why
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
