Explainable Agreement through Simulation for Tasks with Subjective Labels
John Foley

TL;DR
This paper introduces a simulation-based approach to evaluate the agreement and inherent subjectivity in datasets for tasks like relevance and controversy detection, highlighting the limits of current datasets and the need for more data.
Contribution
It proposes using user simulation to measure the maximum achievable scores given dataset noise and subjectivity, providing a new way to interpret classifier performance.
Findings
Simulated truth and predictions reveal dataset limits.
Common datasets are exhausted and need more data.
Current measures do not account for label subjectivity.
Abstract
The field of information retrieval often works with limited and noisy data in an attempt to classify documents into subjective categories, e.g., relevance, sentiment and controversy. We typically quantify a notion of agreement to understand the difficulty of the labeling task, but when we present final results, we do so using measures that are unaware of agreement or the inherent subjectivity of the task. We propose using user simulation to understand the effect size of this noisy agreement data. By simulating truth and predictions, we can understand the maximum scores a dataset can support: for if a classifier is doing better than a reasonable model of a human, we cannot conclude that it is actually better, but that it may be learning noise present in the dataset. We present a brief case study on controversy detection that concludes that a commonly-used dataset has been exhausted: in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis · Safety Systems Engineering in Autonomy · Multi-Agent Systems and Negotiation
