tabulapdf: An R Package to Extract Tables from PDF Documents
Mauricio Vargas Sep\'ulveda, Thomas J. Leeper, Tom Paskhalis and, Manuel Aristar\'an, Jeremy B. Merrill, Mike Tigas

TL;DR
tabulapdf is an R package that simplifies extracting tables from PDF documents by integrating Tabula Java, offering both automatic and manual extraction methods via a user-friendly interface, thus saving time in data analysis workflows.
Contribution
The paper introduces tabulapdf, a new R package that combines automated and manual table extraction from PDFs using Tabula Java, enhancing data retrieval efficiency.
Findings
Enables direct import of PDF tables into R.
Supports manual table selection via Shiny interface.
Reduces time and effort in data extraction processes.
Abstract
tabulapdf is an R package that utilizes the Tabula Java library to import tables from PDF files directly into R. This tool can reduce time and effort in data extraction processes in fields like investigative journalism. It allows for automatic and manual table extraction, the latter facilitated through a Shiny interface, enabling manual areas selection with a computer mouse for data retrieval.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing
MethodsLib
