On consistency scores in text data with an implementation in R
Ke-Li Chiu, Rohan Alexander

TL;DR
This paper presents a reproducible method for cleaning PDF-extracted text using n-gram models, introducing a consistency score to monitor text quality, with tools in R for practical application.
Contribution
It introduces a novel consistency score for text cleaning validation and provides an R implementation with a Shiny app for reproducible text data processing.
Findings
Effective text cleaning process demonstrated on Jane Eyre corpus.
Introduction of a reproducible R package and Shiny app.
Consistency score helps monitor and improve text quality during cleaning.
Abstract
In this paper, we introduce a reproducible cleaning process for the text extracted from PDFs using n-gram models. Our approach compares the originally extracted text with the text generated from, or expected by, these models using earlier text as stimulus. To guide this process, we introduce the notion of a consistency score, which refers to the proportion of text that is expected by the model. This is used to monitor changes during the cleaning process, and across different corpuses. We illustrate our process on text from the book Jane Eyre and introduce both a Shiny application and an R package to make our process easier for others to adopt.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques
