uniblock: Scoring and Filtering Corpus with Unicode Block Information

Yingbo Gao; Weiyue Wang; Hermann Ney

arXiv:1908.09716·cs.CL·August 27, 2019

uniblock: Scoring and Filtering Corpus with Unicode Block Information

Yingbo Gao, Weiyue Wang, Hermann Ney

PDF

Open Access 1 Repo

TL;DR

Uniblock is a statistical method that uses Unicode block information and Gaussian mixture models to score and filter sentences in NLP preprocessing, reducing the need for manual rule scripting.

Contribution

It introduces a simple, effective Unicode-based feature vector and a Gaussian mixture model for automatic sentence filtering across multiple NLP tasks.

Findings

01

Improves sentence filtering accuracy in NLP tasks

02

Reduces manual rule scripting in preprocessing

03

Demonstrates effectiveness across sentiment analysis, language modeling, and translation

Abstract

The preprocessing pipelines in Natural Language Processing usually involve a step of removing sentences consisted of illegal characters. The definition of illegal characters and the specific removal strategy depend on the task, language, domain, etc, which often lead to tiresome and repetitive scripting of rules. In this paper, we introduce a simple statistical method, uniblock, to overcome this problem. For each sentence, uniblock generates a fixed-size feature vector using Unicode block information of the characters. A Gaussian mixture model is then estimated on some clean corpus using variational inference. The learned model can then be used to score sentences and filter corpus. We present experimental results on Sentiment Analysis, Language Modeling and Machine Translation, and show the simplicity and effectiveness of our method.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ringoreality/uniblock
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis