ScanBank: A Benchmark Dataset for Figure Extraction from Scanned   Electronic Theses and Dissertations

Sampanna Yashwant Kahu; William A. Ingram; Edward A. Fox; Jian Wu

arXiv:2106.15320·cs.CV·June 30, 2021·1 cites

ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations

Sampanna Yashwant Kahu, William A. Ingram, Edward A. Fox, Jian Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces ScanBank, a large manually labeled dataset of scanned ETD pages, and demonstrates that training a YOLOv5 model on it significantly improves figure and table extraction from scanned documents compared to existing methods.

Contribution

The paper presents the first manually annotated dataset for figure extraction from scanned ETDs and develops a YOLOv5-based model that outperforms existing methods.

Findings

01

YOLOv5 trained on ScanBank achieves higher accuracy.

02

Data augmentation improves model performance on scanned documents.

03

ScanBank enables better training for figure extraction from scanned PDFs.

Abstract

We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility, since more than 6 million are publicly available, and they constitute an important corpus to aid research and education across disciplines. The corpus is growing as new born-digital documents are included, and since millions of older theses and dissertations have been converted to digital form to be disseminated electronically in institutional repositories. In ETDs, as with other scholarly works, figures and tables can communicate a large amount of information in a concise way. Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs. Considering this problem, our assessment of state-of-the-art figure extraction systems is that the reason they do not function well on scanned PDFs is that they have only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SampannaKahu/ScanBank
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Mathematics, Computing, and Information Processing