Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers
Aivin V. Solatorio, Rafael Macalaba, and James Liounis

TL;DR
This paper introduces a machine learning framework that uses large language models and synthetic data to automate the detection of dataset mentions in research papers, improving scalability and accuracy.
Contribution
It presents a novel two-stage fine-tuning process leveraging synthetic data and LLMs for scalable dataset mention detection in academic literature.
Findings
Outperforms existing models like NuExtract-v1.5 and GLiNER-large-v2.1 in accuracy.
Synthetic data generated by LLMs effectively addresses training data scarcity.
The framework reduces computational overhead while maintaining high recall.
Abstract
Tracking how data is mentioned and used in research papers provides critical insights for improving data discoverability, quality, and production. However, manually identifying and classifying dataset mentions across vast academic literature is resource-intensive and not scalable. This paper presents a machine learning framework that automates dataset mention detection across research domains by leveraging large language models (LLMs), synthetic data, and a two-stage fine-tuning process. We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset. The Phi-3.5-mini instruct model is pre-fine-tuned on this dataset, followed by fine-tuning on a manually annotated subset. At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
