tabulapdf: An R Package to Extract Tables from PDF Documents

Mauricio Vargas Sep\'ulveda; Thomas J. Leeper; Tom Paskhalis and; Manuel Aristar\'an; Jeremy B. Merrill; Mike Tigas

arXiv:2409.14524·cs.IR·September 24, 2024

tabulapdf: An R Package to Extract Tables from PDF Documents

Mauricio Vargas Sep\'ulveda, Thomas J. Leeper, Tom Paskhalis and, Manuel Aristar\'an, Jeremy B. Merrill, Mike Tigas

PDF

Open Access

TL;DR

tabulapdf is an R package that simplifies extracting tables from PDF documents by integrating Tabula Java, offering both automatic and manual extraction methods via a user-friendly interface, thus saving time in data analysis workflows.

Contribution

The paper introduces tabulapdf, a new R package that combines automated and manual table extraction from PDFs using Tabula Java, enhancing data retrieval efficiency.

Findings

01

Enables direct import of PDF tables into R.

02

Supports manual table selection via Shiny interface.

03

Reduces time and effort in data extraction processes.

Abstract

tabulapdf is an R package that utilizes the Tabula Java library to import tables from PDF files directly into R. This tool can reduce time and effort in data extraction processes in fields like investigative journalism. It allows for automatic and manual table extraction, the latter facilitated through a Shiny interface, enabling manual areas selection with a computer mouse for data retrieval.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing

MethodsLib