CodeSwitch-Reddit: Exploration of Written Multilingual Discourse in Online Discussion Forums
Ella Rabinovich, Masih Sultani, Suzanne Stevenson

TL;DR
This paper introduces a large dataset of written code-switched content from Reddit, enabling exploration of sociolinguistic aspects of multilingual online discourse and comparing it with spoken language findings.
Contribution
The paper presents a novel, diverse dataset of written code-switching from Reddit, facilitating research on sociolinguistic patterns and speaker proficiency in online multilingual communication.
Findings
Content and style of written code-switching resemble spoken language patterns
Speaker proficiency influences code-switching behavior in online forums
Dataset enables new research avenues in multilingual online discourse
Abstract
In contrast to many decades of research on oral code-switching, the study of written multilingual productions has only recently enjoyed a surge of interest. Many open questions remain regarding the sociolinguistic underpinnings of written code-switching, and progress has been limited by a lack of suitable resources. We introduce a novel, large, and diverse dataset of written code-switched productions, curated from topical threads of multiple bilingual communities on the Reddit discussion platform, and explore questions that were mainly addressed in the context of spoken language thus far. We investigate whether findings in oral code-switching concerning content and style, as well as speaker proficiency, are carried over into written code-switching in discussion forums. The released dataset can further facilitate a range of research and practical activities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Communication and Language · Multilingual Education and Policy · Hate Speech and Cyberbullying Detection
