TL;DR
This paper introduces a large-scale, automatically labeled dataset for scientific figure extraction, enabling the training of neural networks that significantly improve figure detection in scientific documents.
Contribution
The authors created the largest automatically labeled dataset for scientific figure extraction and trained a neural network that outperforms previous methods.
Findings
Achieved 96.8% average precision in figure detection.
Dataset contains over 5.5 million labels, 4,000 times larger than prior datasets.
Deployed in Semantic Scholar to extract figures from 13 million documents.
Abstract
Non-textual components such as charts, diagrams and tables provide key information in many scientific documents, but the lack of large labeled datasets has impeded the development of data-driven methods for scientific figure extraction. In this paper, we induce high-quality training labels for the task of figure extraction in a large number of scientific documents, with no human intervention. To accomplish this we leverage the auxiliary data provided in two large web collections of scientific documents (arXiv and PubMed) to locate figures and their associated captions in the rasterized PDF. We share the resulting dataset of over 5.5 million induced labels---4,000 times larger than the previous largest figure extraction dataset---with an average precision of 96.8%, to enable the development of modern data-driven methods for this task. We use this dataset to train a deep neural network…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
