# A Tidy Data Model for Natural Language Processing using cleanNLP

**Authors:** Taylor Arnold

arXiv: 1703.09570 · 2018-05-04

## TL;DR

The paper introduces cleanNLP, a fast R package that converts textual data into normalized tables using Stanford's CoreNLP, supporting multiple languages and various NLP annotation tasks.

## Contribution

It provides a unified, efficient data model for NLP tasks in R, integrating multiple annotation tools into a single pipeline for multilingual text processing.

## Key findings

- Supports English, French, German, and Spanish.
- Includes tokenization, POS tagging, NER, sentiment analysis, and more.
- Enables streamlined NLP data analysis in R.

## Abstract

The package cleanNLP provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes Stanford's CoreNLP library, exposing a number of annotation tasks for text written in English, French, German, and Spanish. Annotators include tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and information extraction.

---
Source: https://tomesphere.com/paper/1703.09570