Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases

Nikita Gautam; Doina Caragea; Ignacio Ciampitti; Federico Gomez

arXiv:2603.07050·cs.IR·March 10, 2026

Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases

Nikita Gautam, Doina Caragea, Ignacio Ciampitti, Federico Gomez

PDF

Open Access

TL;DR

This paper presents a web-based tool that uses Large Language Models to automate and scale the development of open scientific databases by combining keyword queries, API data retrieval, and LLM-based filtering, reducing manual effort.

Contribution

The paper introduces a novel, scalable framework leveraging LLMs for automated scientific database construction, applicable across diverse domains with high accuracy.

Findings

01

Achieved 90% overlap with expert-curated databases

02

Reduced manual workload significantly

03

Demonstrated domain-agnostic scalability

Abstract

With the exponential increase in online scientific literature, identifying reliable domain-specific data has become increasingly important but also very challenging. Manual data collection and filtering for domain-specific scientific literature is not only time-consuming but also labor-intensive and prone to errors and inconsistencies. To facilitate automated data collection, the paper introduces a web-based tool that leverages Large Language Models (LLMs) for automated and scalable development of open scientific databases. More specifically, the tool is based on an automated and unified framework that combines keyword-based querying, API-enabled data retrieval, and LLM-powered text classification to construct domain-specific scientific databases. Data is collected from multiple reliable data sources and search engines using a parallel querying technique to construct a combined unified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Machine Learning in Materials Science