Perturbation Sensitivity Analysis to Detect Unintended Model Biases
Vinodkumar Prabhakaran, Ben Hutchinson, Margaret Mitchell

TL;DR
This paper introduces a generic perturbation sensitivity analysis framework to detect unintended biases in NLP models related to named entities, ensuring model outputs are independent of social associations without requiring additional annotations.
Contribution
The paper proposes a novel, annotation-free evaluation method for identifying entity-related biases in NLP models, demonstrated on sentiment and toxicity tasks across multiple genres.
Findings
The framework successfully detects biases in sentiment and toxicity models.
It works across different NLP tasks and genres.
No additional annotated data is needed for bias detection.
Abstract
Data-driven statistical Natural Language Processing (NLP) techniques leverage large amounts of language data to build models that can understand language. However, most language data reflect the public discourse at the time the data was produced, and hence NLP models are susceptible to learning incidental associations around named referents at a particular point in time, in addition to general linguistic meaning. An NLP system designed to model notions such as sentiment and toxicity should ideally produce scores that are independent of the identity of such entities mentioned in text and their social associations. For example, in a general purpose sentiment analysis system, a phrase such as I hate Katy Perry should be interpreted as having the same sentiment as I hate Taylor Swift. Based on this idea, we propose a generic evaluation framework, Perturbation Sensitivity Analysis, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
