Test Suites Task: Evaluation of Gender Fairness in MT with MuST-SHE and INES
Beatrice Savoldi, Marco Gaido, Matteo Negri, Luisa Bentivogli

TL;DR
This paper evaluates the ability of machine translation systems to accurately translate gendered language and produce gender-inclusive translations using two new test suites, highlighting current strengths and challenges in gender fairness.
Contribution
It introduces and assesses two novel test suites, MuST-SHE-WMT23 and INES, for evaluating gender fairness in machine translation, focusing on en-de and de-en pairs.
Findings
Systems perform well on gender translation accuracy.
Generating gender-inclusive language remains a significant challenge.
Human evaluations validate the metrics used in the test suites.
Abstract
As part of the WMT-2023 "Test suites" shared task, in this paper we summarize the results of two test suites evaluations: MuST-SHE-WMT23 and INES. By focusing on the en-de and de-en language pairs, we rely on these newly created test suites to investigate systems' ability to translate feminine and masculine gender and produce gender-inclusive translations. Furthermore we discuss metrics associated with our test suites and validate them by means of human evaluations. Our results indicate that systems achieve reasonable and comparable performance in correctly translating both feminine and masculine gender forms for naturalistic gender phenomena. Instead, the generation of inclusive language forms in translation emerges as a challenging task for all the evaluated MT models, indicating room for future improvements and research on the topic.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Hate Speech and Cyberbullying Detection
