pdfPapers: shell-script utilities for frequency-based multi-word phrase   extraction from PDF documents

Pavel Loskot

arXiv:2101.10554·q-bio.QM·January 27, 2021

pdfPapers: shell-script utilities for frequency-based multi-word phrase extraction from PDF documents

Pavel Loskot

PDF

Open Access

TL;DR

This paper introduces shell script utilities for *nix systems that facilitate text extraction and multi-word phrase frequency analysis from PDF documents, demonstrated on scientific papers in life sciences.

Contribution

Development of robust shell script utilities for extracting and analyzing multi-word phrases from PDFs, enhancing text mining capabilities in life sciences research.

Findings

01

Procedure is robust despite extraction deficiencies

02

Stop-word removal should be limited to phrase boundaries

03

Utilities can convert PDFs into biochemical term lists

Abstract

Biomedical research is intensive in processing information in the previously published papers. This motivated a lot of efforts to provide tools for text mining and information extraction from PDF documents over the past decade. The *nix (Unix/Linux) operating systems offer many tools for working with text files, however, very few such tools are available for processing the contents of PDF files. This paper reports our effort to develop shell script utilities for *nix systems with the core functionality focused on viewing and searching multiple PDF documents combining logical and regular expressions, and enabling more reliable text extraction from PDF documents with subsequent manipulation of the resulting blocks of text. Furthermore, a procedure for extracting the most frequently occurring multi-word phrases was devised and then demonstrated on several scientific papers in life…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Advanced Text Analysis Techniques · Semantic Web and Ontologies