Evaluation of Summarization Systems across Gender, Age, and Race

Anna J{\o}rgensen; Anders S{\o}gaard

arXiv:2110.04384·cs.CL·October 12, 2021

Evaluation of Summarization Systems across Gender, Age, and Race

Anna J{\o}rgensen, Anders S{\o}gaard

PDF

Open Access

TL;DR

This paper investigates how demographic biases in human evaluators can influence the assessment of summarization systems, revealing that evaluation outcomes are sensitive to protected attributes like gender, age, and race.

Contribution

It highlights the impact of evaluator demographics on summarization system evaluation and emphasizes the need to consider protected attributes to ensure fair assessments.

Findings

01

Evaluation outcomes vary with evaluator demographics

02

Biases can lead to skewed system development

03

Demographic-sensitive evaluation affects system fairness

Abstract

Summarization systems are ultimately evaluated by human annotators and raters. Usually, annotators and raters do not reflect the demographics of end users, but are recruited through student populations or crowdsourcing platforms with skewed demographics. For two different evaluation scenarios -- evaluation against gold summaries and system output ratings -- we show that summary evaluation is sensitive to protected attributes. This can severely bias system development and evaluation, leading us to build models that cater for some groups rather than others.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Topic Modeling · Natural Language Processing Techniques