The Digitization of Historical Astrophysical Literature with Highly-Localized Figures and Figure Captions
Jill P. Naiman, Peter K. G. Williams, Alyssa Goodman

TL;DR
This paper introduces a YOLO-based method for extracting figures and captions from scanned astrophysical articles, significantly improving localization accuracy over previous approaches.
Contribution
The authors develop a robust, OCR-enhanced YOLO-based technique for high-precision figure and caption extraction from digitized scientific literature.
Findings
Achieved F1 scores of 90.9% for figures and 92.2% for captions at IOU 0.9
Significant improvement over existing methods in document layout analysis
Effective application to NASA astrophysics literature holdings
Abstract
Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, after they have been processed with Optical Character Recognition (OCR), which uses both grayscale and OCR-features. We focus our efforts on translating the intersection-over-union (IOU) metric from the field of object detection to document layout analysis and quantify "high localization" levels as an IOU of 0.9. When applied to the astrophysics literature holdings of the NASA Astrophysics Data System (ADS), we find F1 scores of 90.9% (92.2%) for figures (figure captions) with the IOU cut-off of 0.9 which is a significant improvement over other state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Mathematics, Computing, and Information Processing · Natural Language Processing Techniques
