A Tidy Data Model for Natural Language Processing using cleanNLP

Taylor Arnold

arXiv:1703.09570·cs.CL·May 4, 2018

A Tidy Data Model for Natural Language Processing using cleanNLP

Taylor Arnold

PDF

1 Repo

TL;DR

The paper introduces cleanNLP, a fast R package that converts textual data into normalized tables using Stanford's CoreNLP, supporting multiple languages and various NLP annotation tasks.

Contribution

It provides a unified, efficient data model for NLP tasks in R, integrating multiple annotation tools into a single pipeline for multilingual text processing.

Findings

01

Supports English, French, German, and Spanish.

02

Includes tokenization, POS tagging, NER, sentiment analysis, and more.

03

Enables streamlined NLP data analysis in R.

Abstract

The package cleanNLP provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes Stanford's CoreNLP library, exposing a number of annotation tasks for text written in English, French, German, and Spanish. Annotators include tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and information extraction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

statsmaths/cleanNLP
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.