On the Role of Text Preprocessing in Neural Network Architectures: An   Evaluation Study on Text Categorization and Sentiment Analysis

Jose Camacho-Collados; Mohammad Taher Pilehvar

arXiv:1707.01780·cs.CL·August 24, 2018

On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis

Jose Camacho-Collados, Mohammad Taher Pilehvar

PDF

3 Repos

TL;DR

This study evaluates how basic text preprocessing steps affect neural network performance in text categorization and sentiment analysis, emphasizing the importance of preprocessing choices for model accuracy and embedding training.

Contribution

It provides an extensive evaluation of simple preprocessing techniques, highlighting their impact and variability in neural text classification tasks.

Findings

01

Tokenization generally suffices for good performance

02

Preprocessing choices significantly affect results

03

Insights into optimal preprocessing for word embeddings

Abstract

Text preprocessing is often the first step in the pipeline of a Natural Language Processing (NLP) system, with potential impact in its final performance. Despite its importance, text preprocessing has not received much attention in the deep learning literature. In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a standard neural text classifier. We perform an extensive evaluation on standard benchmarks from text categorization and sentiment analysis. While our experiments show that a simple tokenization of input text is generally adequate, they also highlight significant degrees of variability across preprocessing techniques. This reveals the importance of paying attention to this usually-overlooked step in the pipeline, particularly when comparing different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.