Zero-shot data citation function classification using transformer-based large language models (LLMs)

Neil Byers; Ali Zaidi; Valerie Skye; Chris Beecroft; Kjiersten Fagnan

arXiv:2511.02936·cs.LG·November 6, 2025

Zero-shot data citation function classification using transformer-based large language models (LLMs)

Neil Byers, Ali Zaidi, Valerie Skye, Chris Beecroft, Kjiersten Fagnan

PDF

Open Access

TL;DR

This paper explores using transformer-based large language models to automatically classify how datasets are used in scientific publications, aiming to scale data citation analysis without manual labeling.

Contribution

It demonstrates the application of Llama 3.1-405B for zero-shot classification of data use cases and introduces a new evaluation framework for this task.

Findings

01

Achieved an F1 score of 0.674 in zero-shot classification.

02

Showed potential of LLMs to identify data use cases without training data.

03

Identified barriers like data availability and computational costs.

Abstract

Efforts have increased in recent years to identify associations between specific datasets and the scientific literature that incorporates them. Knowing that a given publication cites a given dataset, the next logical step is to explore how or why that data was used. Advances in recent years with pretrained, transformer-based large language models (LLMs) offer potential means for scaling the description of data use cases in the published literature. This avoids expensive manual labeling and the development of training datasets for classical machine-learning (ML) systems. In this work we apply an open-source LLM, Llama 3.1-405B, to generate structured data use case labels for publications known to incorporate specific genomic datasets. We also introduce a novel evaluation framework for determining the efficacy of our methods. Our results demonstrate that the stock model can achieve an F1…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling