Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents
Zeyd Boukhers, Cong Yang

TL;DR
This paper compares different feature learning methods, including NLP, CV, and multimodal approaches, for extracting metadata from diverse PDF documents to improve accessibility and adherence to FAIR principles.
Contribution
It provides a comprehensive evaluation of various feature learning techniques for metadata extraction from PDFs with high template variability, offering insights into their effectiveness.
Findings
NLP methods perform well on textual metadata extraction.
Multimodal approaches enhance accuracy in complex documents.
The study highlights strengths and weaknesses of each method.
Abstract
The availability of metadata for scientific documents is pivotal in propelling scientific knowledge forward and for adhering to the FAIR principles (i.e. Findability, Accessibility, Interoperability, and Reusability) of research findings. However, the lack of sufficient metadata in published documents, particularly those from smaller and mid-sized publishers, hinders their accessibility. This issue is widespread in some disciplines, such as the German Social Sciences, where publications often employ diverse templates. To address this challenge, our study evaluates various feature learning and prediction methods, including natural language processing (NLP), computer vision (CV), and multimodal approaches, for extracting metadata from documents with high template variance. We aim to improve the accessibility of scientific documents and facilitate their wider use. To support our comparison…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Data Quality and Management · Semantic Web and Ontologies
