Gender Representation in Open Source Speech Resources
Mahault Garnerin, Solange Rossato, Laurent Besacier

TL;DR
This paper investigates gender representation in open-source speech datasets, highlighting challenges in identifying gender and its impact on fairness, and offers recommendations for better metadata practices to improve transparency.
Contribution
It provides an analysis of gender balance in open speech resources and proposes guidelines for metadata to enhance transparency and fairness in AI speech systems.
Findings
Gender information is difficult to find in open corpora.
Gender balance varies with corpus characteristics.
Recommendations for metadata to improve transparency.
Abstract
With the rise of artificial intelligence (AI) and the growing use of deep-learning architectures, the question of ethics, transparency and fairness of AI systems has become a central concern within the research community. We address transparency and fairness in spoken language systems by proposing a study about gender representation in speech resources available through the Open Speech and Language Resource platform. We show that finding gender information in open source corpora is not straightforward and that gender balance depends on other corpus characteristics (elicited/non elicited speech, low/high resource language, speech task targeted). The paper ends with recommendations about metadata and gender information for researchers in order to assure better transparency of the speech systems built using such corpora.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
