MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs
Zaid Alyafeai, Maged S. Al-Shaibani, Bernard Ghanem

TL;DR
MOLE is a framework that uses Large Language Models to automatically extract and validate metadata from scientific papers, improving efficiency and consistency in dataset cataloging across multiple languages.
Contribution
This paper introduces MOLE, a novel schema-driven LLM-based system for automated metadata extraction and validation from scientific articles in various languages, with a new benchmark for evaluation.
Findings
LLMs show promising results in metadata extraction
Context length and few-shot learning impact performance
Web browsing integration enhances extraction accuracy
Abstract
Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al.,2021) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets' scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies
