Comparison of Feature Learning Methods for Metadata Extraction from PDF   Scholarly Documents

Zeyd Boukhers; Cong Yang

arXiv:2501.05082·cs.IR·January 10, 2025

Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents

Zeyd Boukhers, Cong Yang

PDF

Open Access

TL;DR

This paper compares different feature learning methods, including NLP, CV, and multimodal approaches, for extracting metadata from diverse PDF documents to improve accessibility and adherence to FAIR principles.

Contribution

It provides a comprehensive evaluation of various feature learning techniques for metadata extraction from PDFs with high template variability, offering insights into their effectiveness.

Findings

01

NLP methods perform well on textual metadata extraction.

02

Multimodal approaches enhance accuracy in complex documents.

03

The study highlights strengths and weaknesses of each method.

Abstract

The availability of metadata for scientific documents is pivotal in propelling scientific knowledge forward and for adhering to the FAIR principles (i.e. Findability, Accessibility, Interoperability, and Reusability) of research findings. However, the lack of sufficient metadata in published documents, particularly those from smaller and mid-sized publishers, hinders their accessibility. This issue is widespread in some disciplines, such as the German Social Sciences, where publications often employ diverse templates. To address this challenge, our study evaluates various feature learning and prediction methods, including natural language processing (NLP), computer vision (CV), and multimodal approaches, for extracting metadata from documents with high template variance. We aim to improve the accessibility of scientific documents and facilitate their wider use. To support our comparison…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Data Quality and Management · Semantic Web and Ontologies