Large Language Models and Synthetic Data for Monitoring Dataset Mentions   in Research Papers

Aivin V. Solatorio; Rafael Macalaba; and James Liounis

arXiv:2502.10263·cs.CL·February 17, 2025

Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers

Aivin V. Solatorio, Rafael Macalaba, and James Liounis

PDF

Open Access 1 Repo

TL;DR

This paper introduces a machine learning framework that uses large language models and synthetic data to automate the detection of dataset mentions in research papers, improving scalability and accuracy.

Contribution

It presents a novel two-stage fine-tuning process leveraging synthetic data and LLMs for scalable dataset mention detection in academic literature.

Findings

01

Outperforms existing models like NuExtract-v1.5 and GLiNER-large-v2.1 in accuracy.

02

Synthetic data generated by LLMs effectively addresses training data scarcity.

03

The framework reduces computational overhead while maintaining high recall.

Abstract

Tracking how data is mentioned and used in research papers provides critical insights for improving data discoverability, quality, and production. However, manually identifying and classifying dataset mentions across vast academic literature is resource-intensive and not scalable. This paper presents a machine learning framework that automates dataset mention detection across research domains by leveraging large language models (LLMs), synthetic data, and a two-stage fine-tuning process. We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset. The Phi-3.5-mini instruct model is pre-fine-tuned on this dataset, followed by fine-tuning on a manually annotated subset. At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

worldbank/ai4data-use
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling