Lexical Normalisation of Twitter Data

Bilal Ahmed

arXiv:1409.4614·cs.CL·September 22, 2015

Lexical Normalisation of Twitter Data

Bilal Ahmed

PDF

Open Access

TL;DR

This paper explores techniques for lexical normalisation of Twitter data to address challenges posed by informal language, misspellings, and abbreviations, aiming to improve NLP tool performance on social media text.

Contribution

It introduces and evaluates various lexical normalisation methods specifically tailored for Twitter's informal language and abbreviations, enhancing NLP processing accuracy.

Findings

01

Certain normalisation techniques significantly improve NLP accuracy on Twitter data

02

Lexical normalisation reduces spelling and grammatical errors in Twitter messages

03

Processed data shows better compatibility with standard NLP tools

Abstract

Twitter with over 500 million users globally, generates over 100,000 tweets per minute . The 140 character limit per tweet, perhaps unintentionally, encourages users to use shorthand notations and to strip spellings to their bare minimum "syllables" or elisions e.g. "srsly". The analysis of twitter messages which typically contain misspellings, elisions, and grammatical errors, poses a challenge to established Natural Language Processing (NLP) tools which are generally designed with the assumption that the data conforms to the basic grammatical structure commonly used in English language. In order to make sense of Twitter messages it is necessary to first transform them into a canonical form, consistent with the dictionary or grammar. This process, performed at the level of individual tokens ("words"), is called lexical normalisation. This paper investigates various techniques for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling