Removing Manually-Generated Boilerplate from Electronic Texts:   Experiments with Project Gutenberg e-Books

Owen Kaser; Daniel Lemire

arXiv:0707.1913·cs.DL·August 24, 2016

Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

Owen Kaser, Daniel Lemire

PDF

Open Access

TL;DR

This paper explores a statistical method to automatically remove boilerplate text from Project Gutenberg e-books, reducing manual effort and improving processing of large literary corpora.

Contribution

It demonstrates the effectiveness of statistical techniques in identifying boilerplate sections across diverse texts, with considerations for language knowledge and scalability.

Findings

01

Statistical approach successfully removes most boilerplate text.

02

Some documents require English language knowledge for accurate removal.

03

Survey of technical solutions enhances applicability to large datasets.

Abstract

Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Web Data Mining and Analysis · Algorithms and Data Compression