Identifying the Development and Application of Artificial Intelligence   in Scientific Text

James Dunham; Jennifer Melot; Dewey Murdick

arXiv:2002.07143·cs.DL·May 29, 2020·22 cites

Identifying the Development and Application of Artificial Intelligence in Scientific Text

James Dunham, Jennifer Melot, Dewey Murdick

PDF

Open Access 1 Repo

TL;DR

This paper presents a machine learning approach to identify AI-related research publications across large scientific corpora by leveraging arXiv metadata, enabling dynamic and automated classification without expert labeling.

Contribution

The authors develop a supervised classification method that generalizes to identify AI-relevant publications in multiple large datasets, achieving high accuracy without manual labeling.

Findings

01

Predictive F1 scores between 0.75 and 0.86 for key AI subfields.

02

High precision (0.83) and recall (0.85) for a combined AI relevance model.

03

Supervised classifiers can effectively identify AI research across diverse datasets.

Abstract

We describe a strategy for identifying the universe of research publications relevant to the application and development of artificial intelligence. The approach leverages the arXiv corpus of scientific preprints, in which authors choose subject tags for their papers from a set defined by editors. We compose a functional definition of AI relevance by learning these subjects from paper metadata, and then inferring the arXiv-subject labels of papers in larger corpora: Clarivate Web of Science, Digital Science Dimensions, and Microsoft Academic Graph. This yields predictive classification $F_{1}$ scores between .75 and .86 for Natural Language Processing (cs.CL), Computer Vision (cs.CV), and Robotics (cs.RO). For a single model that learns these and four other AI-relevant subjects (cs.AI, cs.LG, stat.ML, and cs.MA), we see precision of .83 and recall of .85. We evaluate the out-of-domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

georgetown-cset/ai-relevant-papers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Scientific Computing and Data Management