MexPub: Deep Transfer Learning for Metadata Extraction from German Publications
Zeyd Boukhers, Nada Beili, Timo Hartmann, Prantik Goswami and, Muhammad Arslan Zafar

TL;DR
This paper introduces a deep learning approach using Mask R-CNN to accurately extract metadata from German scientific PDFs with diverse layouts, outperforming traditional NLP methods.
Contribution
It presents a novel image-based method with a synthetic dataset for extracting metadata from German publications, addressing layout variability.
Findings
Achieved around 90% accuracy in metadata extraction
Effectively handles diverse layouts and styles in German PDFs
Utilized synthetic data for model fine-tuning
Abstract
Extracting metadata from scientific papers can be considered a solved problem in NLP due to the high accuracy of state-of-the-art methods. However, this does not apply to German scientific publications, which have a variety of styles and layouts. In contrast to most of the English scientific publications that follow standard and simple layouts, the order, content, position and size of metadata in German publications vary greatly among publications. This variety makes traditional NLP methods fail to accurately extract metadata from these publications. In this paper, we present a method that extracts metadata from PDF documents with different layouts and styles by viewing the document as an image. We used Mask R-CNN that is trained on COCO dataset and finetuned with PubLayNet dataset that consists of ~200K PDF snapshots with five basic classes (e.g. text, figure, etc). We refine-tuned the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Image Processing and 3D Reconstruction
MethodsRegion Proposal Network · Convolution · Softmax · RoIAlign · Mask R-CNN
