Reducing a Set of Regular Expressions and Analyzing Differences of   Domain-specific Statistic Reporting

Tobias Kalmbach; Marcel Hoffmann; Nicolas Lell; Ansgar Scherp

arXiv:2211.13632·cs.DL·March 28, 2023

Reducing a Set of Regular Expressions and Analyzing Differences of Domain-specific Statistic Reporting

Tobias Kalmbach, Marcel Hoffmann, Nicolas Lell, Ansgar Scherp

PDF

Open Access 2 Repos

TL;DR

This paper improves a tool for extracting statistical data from scientific papers by reducing rule complexity, analyzing domain differences, and comparing extraction methods between PDF and LaTeX sources.

Contribution

It adapts a regular expression inclusion algorithm to optimize the extraction tool and evaluates its performance across different scientific domains and file formats.

Findings

01

Reduced regular expressions by 33.8% in STEREO

02

Found similar statistical patterns in HCI and medical domains

03

LaTeX sources yield more reliable extraction than PDFs

Abstract

Due to the large amount of daily scientific publications, it is impossible to manually review each one. Therefore, an automatic extraction of key information is desirable. In this paper, we examine STEREO, a tool for extracting statistics from scientific papers using regular expressions. By adapting an existing regular expression inclusion algorithm for our use case, we decrease the number of regular expressions used in STEREO by about $33.8%$ . We reveal common patterns from the condensed rule set that can be used for the creation of new rules. We also apply STEREO, which was previously trained in the life-sciences and medical domain, to a new scientific domain, namely Human-Computer-Interaction (HCI), and re-evaluate it. According to our research, statistics in the HCI domain are similar to those in the medical domain, although a higher percentage of APA-conform statistics were found…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Natural Language Processing Techniques · Data Mining Algorithms and Applications