Whose Preferences? Differences in Fairness Preferences and Their Impact on the Fairness of AI Utilizing Human Feedback
Emilia Agis Lerner, Florian E. Dorner, Elliott Ash, Naman Goel

TL;DR
This paper investigates how demographic differences among annotators influence fairness preferences in content moderation and demonstrates that ensemble classifiers trained on diverse demographic annotations improve fairness outcomes.
Contribution
It introduces a novel dataset on fairness preferences across demographics and shows that ensemble classifiers leveraging diverse annotations enhance fairness in AI moderation.
Findings
Significant demographic gaps in fairness preferences.
Demographics influence perceptions of individual fairness.
Ensemble classifiers outperform single classifiers across demographic groups.
Abstract
There is a growing body of work on learning from human feedback to align various aspects of machine learning systems with human values and preferences. We consider the setting of fairness in content moderation, in which human feedback is used to determine how two comments -- referencing different sensitive attribute groups -- should be treated in comparison to one another. With a novel dataset collected from Prolific and MTurk, we find significant gaps in fairness preferences depending on the race, age, political stance, educational level, and LGBTQ+ identity of annotators. We also demonstrate that demographics mentioned in text have a strong influence on how users perceive individual fairness in moderation. Further, we find that differences also exist in downstream classifiers trained to predict human preferences. Finally, we observe that an ensemble, giving equal weight to classifiers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEthics and Social Impacts of AI
MethodsALIGN
