Evaluation of Summarization Systems across Gender, Age, and Race
Anna J{\o}rgensen, Anders S{\o}gaard

TL;DR
This paper investigates how demographic biases in human evaluators can influence the assessment of summarization systems, revealing that evaluation outcomes are sensitive to protected attributes like gender, age, and race.
Contribution
It highlights the impact of evaluator demographics on summarization system evaluation and emphasizes the need to consider protected attributes to ensure fair assessments.
Findings
Evaluation outcomes vary with evaluator demographics
Biases can lead to skewed system development
Demographic-sensitive evaluation affects system fairness
Abstract
Summarization systems are ultimately evaluated by human annotators and raters. Usually, annotators and raters do not reflect the demographics of end users, but are recruited through student populations or crowdsourcing platforms with skewed demographics. For two different evaluation scenarios -- evaluation against gold summaries and system output ratings -- we show that summary evaluation is sensitive to protected attributes. This can severely bias system development and evaluation, leading us to build models that cater for some groups rather than others.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Topic Modeling · Natural Language Processing Techniques
