Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation
A. Seza Do\u{g}ru\"oz, Sunayana Sitaram, Zheng-Xin Yong

TL;DR
This paper critically examines 68 code-switching datasets, highlighting the neglect of representativeness in data collection and preparation, and offers guidelines to improve the quality and diversity of future multilingual and CSW data sets.
Contribution
It provides an in-depth analysis of existing CSW datasets, identifies key flaws in data collection and preparation, and proposes a checklist to enhance dataset representativeness for better multilingual system development.
Findings
Most CSW data involves English, neglecting other language pairs.
Data collection flaws ignore location, socio-demographic, and register variations.
Lack of clarity in data filtering affects dataset representativeness.
Abstract
Multilingualism is widespread around the world and code-switching (CSW) is a common practice among different language pairs/tuples across locations and regions. However, there is still not much progress in building successful CSW systems, despite the recent advances in Massive Multilingual Language Models (MMLMs). We investigate the reasons behind this setback through a critical study about the existing CSW data sets (68) across language pairs in terms of the collection and preparation (e.g. transcription and annotation) stages. This in-depth analysis reveals that \textbf{a)} most CSW data involves English ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of representativeness in data collection and preparation stages due to ignoring the location based, socio-demographic and register variation in CSW. In addition, lack of clarity on the data selection and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultilingual Education and Policy · Second Language Learning and Teaching · Social Media and Politics
