Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases
Nikita Gautam, Doina Caragea, Ignacio Ciampitti, Federico Gomez

TL;DR
This paper presents a web-based tool that uses Large Language Models to automate and scale the development of open scientific databases by combining keyword queries, API data retrieval, and LLM-based filtering, reducing manual effort.
Contribution
The paper introduces a novel, scalable framework leveraging LLMs for automated scientific database construction, applicable across diverse domains with high accuracy.
Findings
Achieved 90% overlap with expert-curated databases
Reduced manual workload significantly
Demonstrated domain-agnostic scalability
Abstract
With the exponential increase in online scientific literature, identifying reliable domain-specific data has become increasingly important but also very challenging. Manual data collection and filtering for domain-specific scientific literature is not only time-consuming but also labor-intensive and prone to errors and inconsistencies. To facilitate automated data collection, the paper introduces a web-based tool that leverages Large Language Models (LLMs) for automated and scalable development of open scientific databases. More specifically, the tool is based on an automated and unified framework that combines keyword-based querying, API-enabled data retrieval, and LLM-powered text classification to construct domain-specific scientific databases. Data is collected from multiple reliable data sources and search engines using a parallel querying technique to construct a combined unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Machine Learning in Materials Science
