STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers
Steffen Epp, Marcel Hoffmann, Nicolas Lell, Michael Mohr, Ansgar, Scherp

TL;DR
STERIO is a flexible pipeline that efficiently extracts experimental statistics, conditions, and topics from scientific papers, even when they deviate from standard APA style, enabling better analysis of research reports.
Contribution
It introduces a novel active wrapper induction and unsupervised aspect extraction pipeline that covers diverse writing styles with minimal training data.
Findings
Achieved 95% coverage of sentences with only 0.25% of the corpus used for training.
Nearly 100% precision on APA-conform and 95% on non-APA writing styles.
Extracted 113,000 statistics with significant coverage of experimental conditions.
Abstract
A common writing style for statistical results are the recommendations of the American Psychology Association, known as APA-style. However, in practice, writing styles vary as reports are not 100% following APA-style or parameters are not reported despite being mandatory. In addition, the statistics are not reported in isolation but in context of experimental conditions investigated and the general topic. We address these challenges by proposing a flexible pipeline STEREO based on active wrapper induction and unsupervised aspect extraction. We applied our pipeline to the over 100,000 documents in the CORD-19 dataset. It required only 0.25% of the corpus (about 500 documents) to learn statistics extraction rules that cover 95% of the sentences in CORD-19. The statistic extraction has nearly 100% precision on APA-conform and 95% precision on non-APA writing styles. In total, we were able…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Topic Modeling · Data Analysis with R
