An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for   Information Retrieval and Stance Detection

Anurag Roy; Shalmoli Ghosh; Kripabandhu Ghosh; Saptarshi Ghosh

arXiv:2101.03303·cs.IR·January 12, 2021

An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection

Anurag Roy, Shalmoli Ghosh, Kripabandhu Ghosh, Saptarshi Ghosh

PDF

1 Repo

TL;DR

This paper introduces an unsupervised, language-independent text normalization algorithm that effectively cleans noisy text, improving retrieval and stance detection without requiring training data or human intervention.

Contribution

The authors present a novel unsupervised normalization method applicable across languages, handling various noise types without supervised resources.

Findings

01

Improves retrieval accuracy over baseline methods

02

Enhances stance detection performance

03

Works effectively on multiple languages and noise types

Abstract

A large fraction of textual data available today contains various types of 'noise', such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text. There have been several efforts towards cleaning or normalizing noisy text; however, many of the existing text normalization methods are supervised and require language-dependent resources or large amounts of training data that is difficult to obtain. We propose an unsupervised algorithm for text normalization that does not need any training data / human intervention. The proposed algorithm is applicable to text over different languages, and can handle both machine-generated and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ranarag/UnsupClean
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.