Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus
Svetlana Churina, Akshat Gupta, Insyirah Mujtahid, Kokil Jaidka

TL;DR
This paper introduces the NUS ABC Codemixed Corpus, a large, annotated dataset of multilingual chat messages capturing code-mixing patterns, aimed at advancing research in computational linguistics and NLP.
Contribution
It presents the first publicly available, author-labeled corpus of code-mixed chat messages with detailed metadata, enabling better modeling of multilingual conversations.
Findings
Contains over 355,641 messages with diverse code-mixing patterns
Focuses on English, Mandarin, and other languages
Provides structured dataset in JSON format for research use
Abstract
Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse, reflecting natural multilingual communication patterns. Despite its prominence in informal interactions such as social media, chat messages and instant-messaging exchanges, there has been a lack of publicly available corpora that are author-labeled and suitable for modeling human conversations and relationships. This study introduces the first labeled and general-purpose corpus for understanding code-mixing in context while maintaining rigorous privacy and ethical standards. Our live project will continuously gather, verify, and integrate code-mixed messages into a structured dataset released in JSON format, accompanied by detailed metadata and linguistic statistics. To date, it includes over 355,641 messages spanning various code-mixing patterns, with a primary focus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Hate Speech and Cyberbullying Detection · Natural Language Processing Techniques
MethodsFocus
