Regression for citation data: An evaluation of different methods
Mike Thelwall, Paul Wilson

TL;DR
This paper evaluates different regression methods for citation data, demonstrating that simple log transformations combined with linear models outperform traditional negative binomial regression, especially for lognormally distributed citation counts.
Contribution
It introduces and compares effective regression strategies for citation data, highlighting the advantages of log transformation and linear models over negative binomial regression.
Findings
Log transformation with linear regression performs better for citation data.
Discarding zero citations and then applying log and linear models yields reasonable results.
Using generalized linear models with continuous lognormal data is also effective.
Abstract
Citations are increasingly used for research evaluations. It is therefore important to identify factors affecting citation scores that are unrelated to scholarly quality or usefulness so that these can be taken into account. Regression is the most powerful statistical technique to identify these factors and hence it is important to identify the best regression strategy for citation data. Citation counts tend to follow a discrete lognormal distribution and, in the absence of alternatives, have been investigated with negative binomial regression. Using simulated discrete lognormal data (continuous lognormal data rounded to the nearest integer) this article shows that a better strategy is to add one to the citations, take their log and then use the general linear (ordinary least squares) model for regression (e.g., multiple linear regression, ANOVA), or to use the generalised linear model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
