Socially Responsible Data for Large Multilingual Language Models
Andrew Smart, Ben Hutchinson, Lameck Mbangula Amugongo, Suzanne, Dikker, Alex Zito, Amber Ebinama, Zara Wudiri, Ding Wang, Erin van Liemt,, Jo\~ao Sedoc, Seyi Olojo, Stanley Uwakwe, Edem Wornyo, Sonja Schmer-Galunder,, Jamila Smith-Loud

TL;DR
This paper discusses the ethical, social, and cultural challenges of collecting data for multilingual large language models, emphasizing community involvement and proposing guidelines for responsible data practices.
Contribution
It offers twelve recommendations for ethically and culturally responsible data collection for underrepresented languages in LLMs, based on recent scholarship and community engagement.
Findings
Highlighting ethical concerns in data collection for underrepresented languages
Proposing community-centered approaches and participatory design
Providing guidelines to mitigate exploitation and cultural insensitivity
Abstract
Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years, but their training data is largely English text. There is growing interest in multilingual LLMs, and various efforts are striving for models to accommodate languages of communities outside of the Global North, which include many languages that have been historically underrepresented in digital realms. These languages have been coined as "low resource languages" or "long-tail languages", and LLMs performance on these languages is generally poor. While expanding the use of LLMs to more languages may bring many potential benefits, such as assisting cross-community communication and language preservation, great care must be taken to ensure that data collection on these languages is not extractive and that it does not reproduce exploitative practices of the past. Collecting data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
