Zero-shot data citation function classification using transformer-based large language models (LLMs)
Neil Byers, Ali Zaidi, Valerie Skye, Chris Beecroft, Kjiersten Fagnan

TL;DR
This paper explores using transformer-based large language models to automatically classify how datasets are used in scientific publications, aiming to scale data citation analysis without manual labeling.
Contribution
It demonstrates the application of Llama 3.1-405B for zero-shot classification of data use cases and introduces a new evaluation framework for this task.
Findings
Achieved an F1 score of 0.674 in zero-shot classification.
Showed potential of LLMs to identify data use cases without training data.
Identified barriers like data availability and computational costs.
Abstract
Efforts have increased in recent years to identify associations between specific datasets and the scientific literature that incorporates them. Knowing that a given publication cites a given dataset, the next logical step is to explore how or why that data was used. Advances in recent years with pretrained, transformer-based large language models (LLMs) offer potential means for scaling the description of data use cases in the published literature. This avoids expensive manual labeling and the development of training datasets for classical machine-learning (ML) systems. In this work we apply an open-source LLM, Llama 3.1-405B, to generate structured data use case labels for publications known to incorporate specific genomic datasets. We also introduce a novel evaluation framework for determining the efficacy of our methods. Our results demonstrate that the stock model can achieve an F1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling
