Extracting Body Text from Academic PDF Documents for Text Mining

Changfeng Yu; Cheng Zhang; Jie Wang

arXiv:2010.12647·cs.IR·October 27, 2020

Extracting Body Text from Academic PDF Documents for Text Mining

Changfeng Yu, Cheng Zhang, Jie Wang

PDF

Open Access

TL;DR

This paper introduces PDFBoT, a system that accurately extracts complete body text, sentences, and paragraphs from academic PDFs, improving text mining by effectively distinguishing main content from nonbody elements.

Contribution

The paper presents PDFBoT, a novel method combining layout detection, feature-based filtering, and syntactic tagging to enhance text extraction accuracy from complex PDF layouts.

Findings

01

Achieves 0.99 F1 score in sentence extraction

02

Attains 0.96 F1 score in paragraph extraction

03

Reaches 0.98 F1 score in removing nonbody content

Abstract

Accurate extraction of body text from PDF-formatted academic documents is essential in text-mining applications for deeper semantic understandings. The objective is to extract complete sentences in the body text into a txt file with the original sentence flow and paragraph boundaries. Existing tools for extracting text from PDF documents would often mix body and nonbody texts. We devise and implement a system called PDFBoT to detect multiple-column layouts using a line-sweeping technique, remove nonbody text using computed text features and syntactic tagging in backward traversal, and align the remaining text back to sentences and paragraphs. We show that PDFBoT is highly accurate with average F1 scores of, respectively, 0.99 on extracting sentences, 0.96 on extracting paragraphs, and 0.98 on removing text on tables, figures, and charts over a corpus of PDF documents randomly selected…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies