uniblock: Scoring and Filtering Corpus with Unicode Block Information
Yingbo Gao, Weiyue Wang, Hermann Ney

TL;DR
Uniblock is a statistical method that uses Unicode block information and Gaussian mixture models to score and filter sentences in NLP preprocessing, reducing the need for manual rule scripting.
Contribution
It introduces a simple, effective Unicode-based feature vector and a Gaussian mixture model for automatic sentence filtering across multiple NLP tasks.
Findings
Improves sentence filtering accuracy in NLP tasks
Reduces manual rule scripting in preprocessing
Demonstrates effectiveness across sentiment analysis, language modeling, and translation
Abstract
The preprocessing pipelines in Natural Language Processing usually involve a step of removing sentences consisted of illegal characters. The definition of illegal characters and the specific removal strategy depend on the task, language, domain, etc, which often lead to tiresome and repetitive scripting of rules. In this paper, we introduce a simple statistical method, uniblock, to overcome this problem. For each sentence, uniblock generates a fixed-size feature vector using Unicode block information of the characters. A Gaussian mixture model is then estimated on some clean corpus using variational inference. The learned model can then be used to score sentences and filter corpus. We present experimental results on Sentiment Analysis, Language Modeling and Machine Translation, and show the simplicity and effectiveness of our method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
