Predicting citation impact of research papers using GPT and other text embeddings
Adilson Vital Jr., Filipi N. Silva, Osvaldo N. Oliveira Jr., Diego, R. Amancio

TL;DR
This study demonstrates that using GPT-based text embeddings and machine learning, particularly random forests, can predict whether research papers will be among the top 20% most cited within a journal with about 80% accuracy, relying solely on abstract content.
Contribution
The paper introduces a novel approach combining GPT embeddings and machine learning to predict research paper impact based solely on abstract text, without considering author or institutional factors.
Findings
GPT embeddings outperform other text embedding methods.
Random forest achieved 80% accuracy in impact prediction.
TFIDF performs nearly as well as GPT in impact prediction.
Abstract
The impact of research papers, typically measured in terms of citation counts, depends on several factors, including the reputation of the authors, journals, and institutions, in addition to the quality of the scientific work. In this paper, we present an approach that combines natural language processing and machine learning to predict the impact of papers in a specific journal. Our focus is on the text, which should correlate with impact and the topics covered in the research. We employed a dataset of over 40,000 articles from ACS Applied Materials and Interfaces spanning from 2012 to 2022. The data was processed using various text embedding techniques and classified with supervised machine learning algorithms. Papers were categorized into the top 20% most cited within the journal, using both yearly and cumulative citation counts as metrics. Our analysis reveals that the method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicsscientometrics and bibliometrics research
