Designing Evaluations of Machine Learning Models for Subjective Inference: The Case of Sentence Toxicity
Agathe Balayn, Alessandro Bozzon

TL;DR
This paper emphasizes the importance of evaluating machine learning models for subjective properties like bias and toxicity, proposing initial specifications to guide the creation of evaluation datasets.
Contribution
It introduces a set of specifications for evaluating biases in ML models on subjective tasks, exemplified through sentence toxicity inference.
Findings
Proposes specifications for bias evaluation datasets
Highlights challenges in instantiating these specifications
Suggests future work for crowdsourcing dataset creation
Abstract
Machine Learning (ML) is increasingly applied in real-life scenarios, raising concerns about bias in automatic decision making. We focus on bias as a notion of opinion exclusion, that stems from the direct application of traditional ML pipelines to infer subjective properties. We argue that such ML systems should be evaluated with subjectivity and bias in mind. Considering the lack of evaluation standards yet to create evaluation benchmarks, we propose an initial list of specifications to define prior to creating evaluation datasets, in order to later accurately evaluate the biases. With the example of a sentence toxicity inference system, we illustrate how the specifications support the analysis of biases related to subjectivity. We highlight difficulties in instantiating these specifications and list future work for the crowdsourcing community to help the creation of appropriate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Adversarial Robustness in Machine Learning · Privacy-Preserving Technologies in Data
