pdfPapers: shell-script utilities for frequency-based multi-word phrase extraction from PDF documents
Pavel Loskot

TL;DR
This paper introduces shell script utilities for *nix systems that facilitate text extraction and multi-word phrase frequency analysis from PDF documents, demonstrated on scientific papers in life sciences.
Contribution
Development of robust shell script utilities for extracting and analyzing multi-word phrases from PDFs, enhancing text mining capabilities in life sciences research.
Findings
Procedure is robust despite extraction deficiencies
Stop-word removal should be limited to phrase boundaries
Utilities can convert PDFs into biochemical term lists
Abstract
Biomedical research is intensive in processing information in the previously published papers. This motivated a lot of efforts to provide tools for text mining and information extraction from PDF documents over the past decade. The *nix (Unix/Linux) operating systems offer many tools for working with text files, however, very few such tools are available for processing the contents of PDF files. This paper reports our effort to develop shell script utilities for *nix systems with the core functionality focused on viewing and searching multiple PDF documents combining logical and regular expressions, and enabling more reliable text extraction from PDF documents with subsequent manipulation of the resulting blocks of text. Furthermore, a procedure for extracting the most frequently occurring multi-word phrases was devised and then demonstrated on several scientific papers in life…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Advanced Text Analysis Techniques · Semantic Web and Ontologies
