MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

Zaid Alyafeai; Maged S. Al-Shaibani; Bernard Ghanem

arXiv:2505.19800·cs.CL·September 19, 2025

MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

Zaid Alyafeai, Maged S. Al-Shaibani, Bernard Ghanem

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

MOLE is a framework that uses Large Language Models to automatically extract and validate metadata from scientific papers, improving efficiency and consistency in dataset cataloging across multiple languages.

Contribution

This paper introduces MOLE, a novel schema-driven LLM-based system for automated metadata extraction and validation from scientific articles in various languages, with a new benchmark for evaluation.

Findings

01

LLMs show promising results in metadata extraction

02

Context length and few-shot learning impact performance

03

Web browsing integration enhances extraction accuracy

Abstract

Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al.,2021) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets' scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ivul-kaust/mole
noneOfficial

Datasets

IVUL-KAUST/MOLE
dataset· 28 dl
28 dl

Videos

MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs· underline

Taxonomy

TopicsSemantic Web and Ontologies