Identifying the Development and Application of Artificial Intelligence in Scientific Text
James Dunham, Jennifer Melot, Dewey Murdick

TL;DR
This paper presents a machine learning approach to identify AI-related research publications across large scientific corpora by leveraging arXiv metadata, enabling dynamic and automated classification without expert labeling.
Contribution
The authors develop a supervised classification method that generalizes to identify AI-relevant publications in multiple large datasets, achieving high accuracy without manual labeling.
Findings
Predictive F1 scores between 0.75 and 0.86 for key AI subfields.
High precision (0.83) and recall (0.85) for a combined AI relevance model.
Supervised classifiers can effectively identify AI research across diverse datasets.
Abstract
We describe a strategy for identifying the universe of research publications relevant to the application and development of artificial intelligence. The approach leverages the arXiv corpus of scientific preprints, in which authors choose subject tags for their papers from a set defined by editors. We compose a functional definition of AI relevance by learning these subjects from paper metadata, and then inferring the arXiv-subject labels of papers in larger corpora: Clarivate Web of Science, Digital Science Dimensions, and Microsoft Academic Graph. This yields predictive classification scores between .75 and .86 for Natural Language Processing (cs.CL), Computer Vision (cs.CV), and Robotics (cs.RO). For a single model that learns these and four other AI-relevant subjects (cs.AI, cs.LG, stat.ML, and cs.MA), we see precision of .83 and recall of .85. We evaluate the out-of-domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Scientific Computing and Data Management
