Variable Extraction for Model Recovery in Scientific Literature
Chunwei Liu, Enrique Noriega-Atala, Adarsh Pyarelal, Clayton T, Morrison, Mike Cafarella

TL;DR
This paper evaluates methods for extracting variables from scientific literature, introduces a benchmark dataset, and demonstrates that large language models outperform rule-based systems in this task, aiding automatic model recovery.
Contribution
It introduces a benchmark dataset for variable extraction and shows that LLMs significantly outperform rule-based methods in extracting variables from scientific papers.
Findings
LLMs outperform rule-based extraction methods.
Combining rule-based and LLM methods yields marginal improvements.
LLMs show strong potential for automatic scientific artifact comprehension.
Abstract
The global output of academic publications exceeds 5 million articles per year, making it difficult for humans to keep up with even a tiny fraction of scientific output. We need methods to navigate and interpret the artifacts -- texts, graphs, charts, code, models, and datasets -- that make up the literature. This paper evaluates various methods for extracting mathematical model variables from epidemiological studies, such as ``infection rate (),'' ``recovery rate (),'' and ``mortality rate ().'' Variable extraction appears to be a basic task, but plays a pivotal role in recovering models from scientific literature. Once extracted, we can use these variables for automatic mathematical modeling, simulation, and replication of published results. We introduce a benchmark dataset comprising manually-annotated variable descriptions and variable values extracted from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsScientific Computing and Data Management · Biomedical Text Mining and Ontologies · Image Processing and 3D Reconstruction
