Social Biases in Automatic Evaluation Metrics for NLG
Mingqi Gao, Xiaojun Wan

TL;DR
This paper investigates social biases, especially gender bias, in automatic evaluation metrics for NLP text generation, revealing that these biases influence metric assessments and vary with gender-swapped references.
Contribution
It introduces a novel method using WEAT and SEAT to quantify biases in evaluation metrics and constructs gender-swapped datasets to analyze bias impact on evaluation.
Findings
Biases are prevalent in model-based evaluation metrics.
Gender swapping affects the correlation between metrics and human judgments.
Evaluation metrics tend to favor male hypotheses with gender-neutral references.
Abstract
Many studies have revealed that word embeddings, language models, and models for specific downstream tasks in NLP are prone to social biases, especially gender bias. Recently these techniques have been gradually applied to automatic evaluation metrics for text generation. In the paper, we propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics and discover that social biases are also widely present in some model-based automatic evaluation metrics. Moreover, we construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks. Results show that given gender-neutral references in the evaluation, model-based evaluation metrics may show a preference for the male hypothesis, and the performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗aieng-lab/bert-base-cased-gradiend-gender-debiasedmodel
- 🤗aieng-lab/bert-large-cased-gradiend-gender-debiasedmodel· 6 dl6 dl
- 🤗aieng-lab/distilbert-base-cased-gradiend-gender-debiasedmodel· 6 dl6 dl
- 🤗aieng-lab/roberta-large-gradiend-gender-debiasedmodel· 4 dl4 dl
- 🤗aieng-lab/gpt2-gradiend-gender-debiasedmodel· 3 dl3 dl
- 🤗aieng-lab/Llama-3.2-3B-gradiend-gender-debiasedmodel· 5 dl5 dl
- 🤗aieng-lab/Llama-3.2-3B-Instruct-gradiend-gender-debiasedmodel· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsTest
