Extraction of tabulated statistical results with tableParser
Ingmar B\"oschen

TL;DR
This paper introduces the R package *tableParser* for extracting and analyzing statistical results from tables in scientific documents across multiple formats, enabling large-scale meta-analyses and consistency checks.
Contribution
The work presents a new R package that efficiently extracts, decodes, and analyzes statistical test results from various document formats, improving data extraction accuracy and scalability.
Findings
Effective extraction from HTML and DOCX with high accuracy.
Limited decoding capabilities for PDFs due to extraction issues.
Supports large-scale analysis of statistical reporting practices.
Abstract
Tabulated content is omnipresent in scientific literature. This work presents the R package *tableParser*, designed to extract and postprocess tables from NISO-JATS-encoded XML, HTML, DOCX, and, with limitations, PDF documents. *tableParser* focuses on extracting and analyzing statistical test results reported in scientific publications. It can be used for large-scale analysis of effect sizes, reporting practices, or summarization of results, as well as for checking completeness and consistency of standard test results in unpublished documents. Documents can be processed in three decoding levels. *table2matrix()* compiles all tables into a list of character matrices with captions and footnotes. *table2text()* collapses the matrix contents into human-readable text, mimicking a screen reader. Optionally, many common codings that are reported within the table's caption and footnote can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSAS software applications and methods · Data Analysis with R · Mathematics, Computing, and Information Processing
