Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus
Tao Chen, Min-Yen Kan

TL;DR
This paper details the creation of a publicly accessible SMS corpus, collected with privacy considerations, enabling consistent comparative research on SMS messages in multiple languages.
Contribution
It introduces a methodology for collecting and releasing a large, privacy-conscious, multilingual SMS corpus with ongoing updates and metadata for diverse analyses.
Findings
Collected about 60,000 messages to date
Corpus includes English and Mandarin Chinese
Provides monthly updates and detailed metadata
Abstract
Short Message Service (SMS) messages are largely sent directly from one person to another from their mobile phones. They represent a means of personal communication that is an important communicative artifact in our current digital era. As most existing studies have used private access to SMS corpora, comparative studies using the same raw SMS data has not been possible up to now. We describe our efforts to collect a public SMS corpus to address this problem. We use a battery of methodologies to collect the corpus, paying particular attention to privacy issues to address contributors' concerns. Our live project collects new SMS message submissions, checks their quality and adds the valid messages, releasing the resultant corpus as XML and as SQL dumps, along with corpus statistics, every month. We opportunistically collect as much metadata about the messages and their sender as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
