Exploring Word Embeddings for Unsupervised Textual User-Generated   Content Normalization

Thales Felipe Costa Bertaglia; Maria das Gra\c{c}as Volpe Nunes

arXiv:1704.02963·cs.CL·April 11, 2017·35 cites

Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization

Thales Felipe Costa Bertaglia, Maria das Gra\c{c}as Volpe Nunes

PDF

Open Access

TL;DR

This paper introduces an unsupervised, language-independent method using word embeddings to normalize user-generated content, effectively correcting errors and slang in Brazilian Portuguese reviews.

Contribution

It presents a novel unsupervised approach leveraging word embeddings for content normalization, outperforming existing tools in Brazilian Portuguese.

Findings

01

High correction rate of orthographic errors and slang

02

Outperforms current tools for Brazilian Portuguese

03

Method is language and domain independent

Abstract

Text normalization techniques based on rules, lexicons or supervised training requiring large corpora are not scalable nor domain interchangeable, and this makes them unsuitable for normalizing user-generated content (UGC). Current tools available for Brazilian Portuguese make use of such techniques. In this work we propose a technique based on distributed representation of words (or word embeddings). It generates continuous numeric vectors of high-dimensionality to represent words. The vectors explicitly encode many linguistic regularities and patterns, as well as syntactic and semantic word relationships. Words that share semantic similarity are represented by similar vectors. Based on these features, we present a totally unsupervised, expandable and language and domain independent method for learning normalization lexicons from word embeddings. Our approach obtains high correction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification