How to Evaluate Word Representations of Informal Domain?

Yekun Chai; Naomi Saphra; Adam Lopez

arXiv:1911.04669·cs.CL·November 14, 2019

How to Evaluate Word Representations of Informal Domain?

Yekun Chai, Naomi Saphra, Adam Lopez

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new method for evaluating word embeddings in informal domains like Twitter by automatically extracting variant spelling pairs from UrbanDictionary, reducing the need for text normalization.

Contribution

It presents an automatic approach to derive variant spelling pairs for informal language, enabling more direct evaluation of word representations without normalization.

Findings

01

Large list of variant spelling pairs extracted from UrbanDictionary

02

Facilitates direct evaluation of non-standard word embeddings

03

Reduces reliance on traditional text normalization pipelines

Abstract

Diverse word representations have surged in most state-of-the-art natural language processing (NLP) applications. Nevertheless, how to efficiently evaluate such word embeddings in the informal domain such as Twitter or forums, remains an ongoing challenge due to the lack of sufficient evaluation dataset. We derived a large list of variant spelling pairs from UrbanDictionary with the automatic approaches of weakly-supervised pattern-based bootstrapping and self-training linear-chain conditional random field (CRF). With these extracted relation pairs we promote the odds of eliding the text normalization procedure of traditional NLP pipelines and directly adopting representations of non-standard words in the informal domain. Our code is available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cyk1337/UrbanDict
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies