Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus

Tao Chen; Min-Yen Kan

arXiv:1112.2468·cs.CL·September 4, 2012

Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus

Tao Chen, Min-Yen Kan

PDF

TL;DR

This paper details the creation of a publicly accessible SMS corpus, collected with privacy considerations, enabling consistent comparative research on SMS messages in multiple languages.

Contribution

It introduces a methodology for collecting and releasing a large, privacy-conscious, multilingual SMS corpus with ongoing updates and metadata for diverse analyses.

Findings

01

Collected about 60,000 messages to date

02

Corpus includes English and Mandarin Chinese

03

Provides monthly updates and detailed metadata

Abstract

Short Message Service (SMS) messages are largely sent directly from one person to another from their mobile phones. They represent a means of personal communication that is an important communicative artifact in our current digital era. As most existing studies have used private access to SMS corpora, comparative studies using the same raw SMS data has not been possible up to now. We describe our efforts to collect a public SMS corpus to address this problem. We use a battery of methodologies to collect the corpus, paying particular attention to privacy issues to address contributors' concerns. Our live project collects new SMS message submissions, checks their quality and adds the valid messages, releasing the resultant corpus as XML and as SQL dumps, along with corpus statistics, every month. We opportunistically collect as much metadata about the messages and their sender as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.